Crab Sojourn.

No Where. Now Here.

Paper Critique 5

leave a comment »

Paper: Automatic Gait Optimization with Gaussian Process Regression

Author: Daniel Lizotte

 

Overview

The authors proposed a new approach for robot gai evaluations, based on Gaussian process. The reason for adopting Gaussian process may because the authors’ intension of improving the previous approaches, for the previous evolutionary approaches have mainly three drawbacks: easy to be stuck at local maximum, throwing away history evolutions, and not including the noise parameters.

As the authors mentioned at last, their model is too simple, as they assumed to much and every parameter is manually given. For a Bayesian model, the most important thing is to learn those parameters instead of to give.

 

Contribution

The authors compared the results of Gaussian process with the other approach which performs best at the current literature, and came to outperform among all. Besides the performance, all three drawbacks were overcome, and this is the most contribution the authors made.

However, I think it may not be so sure to say that the authors have improved the “drawbacks”:

1, Does the new approach indeed gets rid of the local maximum? The algorithm of Gaussian process surely is. However, when the authors were talking about their implementation details, they said “in order to compute the most probable improvement point”, they used hill climber. This “hill climber” seems to be a tool with hill climbing technique, which, we all know that, is easy to be stuck at local maximum, and therefore this new approach may come to the same situation as previous approach.

2, For previous work, the authors said once the robot die out of population or after the gradient step is taken, they will forgot previously evaluated gaits. This seems to be arguing that the evaluation, or evolution, is not revertible. If I understood correctly, then their new approach is also irreversible, so how can they say they have improved this drawback?

3, For the last point, they really indeed added a noise parameter into their model. However, this is very naive method because that parameter is based on a normal distribution in which the variation is a fixed number, and they tried several times to find the optimal value of this fixed number so that the overall performance is best. This is quit cunning because they want to achieve best result with a noise, with the noise is not random but chosen.To my knowledge, the “noise” should be random so as to simulate the real situation, so I think their way of adding a noise does not make much sense in “improving”. (The authors themselves were not so sure about whether their choice of noise parameter was actually performing well, as they said in session 4.2 “…a value of omega^2 = 0.01, which SEEMED to work well in practice”.)

Therefore there are still lot of work to do before they can really say they have made a big progress.

 

Algorithm

There’s not much to say about their idea of applying the Gaussian process in their model. This is quite natural when there are several parameters and they want to find the optimal one.

However, the authors didn’t mention in detail about how they used the Gaussian, so I am not sure if they really used that correctly. It seems that the model is largely based on the searching (at Eq(4)): they tried to find the point to maximize the probability of improvement, by a searching, which is mentioned at session 4.3, saying that the search is “hill climber in MATLAB”. This is the part I am concerned with.

 

Experiment

The most appreciable point in the experiment part is that, the authors explained that because the experimental setups may be different from previous work, so they did not simply compared their result figures with previous ones; instead, they implemented both theirs and previous ones under the same conditions in order to make the results comparable. This is the right way to do experiment, and so the comparison is convincing.

The new evaluation parameter smoothness is a good one, which takes into account of practical performance so it makes the robot’s behaviour more like human’s.

In session 4, they said they assume the number of parameters is k. They said for a typical gait, there could have thousands of parameters, so why they chose this number to be 15 during their experiment? 15 is a number far from “thousands”. They really need to change their model so it is more like an Infinite Gaussian process, where the number of parameters is not given, but to learn in the training process. And they didn’t mention what the 15 parameters are, either.

For other fixed values such as mean and variance, I guess they should have their own distribution, such as Beta and Gamma, and let them learn by self during training, because you cannot simply say that you give a number and try several number to find the best ones. This is not sustainable, and once you change your experiment setting you have to test again!

The quoted sentence “For some random gait parameters, we expect the…” is funny. Where does it come from? How confident the correctness is? How sure the authors were when they quoted this words? The “domain expert” is not clear either (maybe it’s because of my lack of background knowledge?) .

The smoothness is good, as mentioned before, but the idea of “to find a near-optimal gait in as few evaluations as possible” is tricky, because it is hard to decide what the “stopping criteria” is: for example, you don’t know if next iteration will give you a much much better optimal gait although the current gait already has a satisfactory gait. The optimal gait is important, but number of evaluation is important too. Minimizing evaluation times will be achieved at the cost of gait, and there must be a threshold, or stopping criteria for finding the most balanced point. But then how to determine this trade-off point is also a question. The authors didn’t mention, and I guess they just arbitrarily assigned a value, such as “when gait reaches n, or evaluation times is k, then stop and goto next point”.

In session 5.3, they mentioned about “the independence of this pair of parameters from the result”. I am a little bit confused what the pair of parameters are.

Besides, how they determined the smoothness score?

 

Readability

Robot gait evaluation is a new topic for me, so there may be several terms that I am not familiar with. For example the domain expert. Is that a domain or a person? I couldn’t find satisfactory definition in Wiki (who said that’s a PERSON), so I assumed it is a person, and felt funny when the authors said the “domain expert was not familiar with our particular experimental setup”.

The main idea (Gaussian process) was explained concisely but precisely, and each of the terminology was stated, which is good. However, if the authors could talk about how they apply the Gaussian process in their model, that would be better. (I still suspect that they didn’t use GP in a right way.)

 

 

Written by Yvette

November 18, 2010 at 11:25 AM

Posted in My Work

Tagged with , , , ,

Paper Critique 4

leave a comment »

Paper: The Infinite Factorial Hidden Markov Model

Author: Jurgen Van Gael, Yee Whye Teh, Zoubin Ghahramani

 

Overview

The idea is quite good that authors proposed a model for infinite number of factorial hidden Markov model, which is constructed from hidden Markov model and whose time-slice Bayesian network has a set of Markov chains S(m)➝S(m-1) who contribute together on a single observed variable Y.

This model is a big advance compared with the previous hidden Markov model who assumes only one Markov chain and ignores latent variables (which are other S(m)s to my understanding).

However, this model still has some strong assumptions:

1, For example, there’s only one single observation Y, so that in the later experiment the output (as well as the states) is binary. But in reality, we seldom see a binary observation.

2, The representation for this model of Indian Buffet Process is simple because features can be treated as independent variables so it explains well for this example. In practical, latent variables themselves are acting on each other. (The coupled HMM is an example that describes the above situation. Maybe we should work a model for that.)

3, Besides, this model is limited because it works only on discrete variables.

 

Soundness

The model is easy to state, but how to plug into the mathematical expression is not that easy. There are something that I am not convinced by the authors’ statements (or maybe it’s just because of my lack of knowledge?)

1, The probability of the transition 0➝1 is a distribution. So it can be written asbecause. To my knowledge, the two parameters of Beta distribution are therefore corresponding to the number of 0➝1 events and non 0➝1 events that we have “seen” before. Therefore I am confused about authors’ setting of the two parameters as α/M and 1 (I think it should be M-α/M). Is it omitted because it’s too easy for everybody to understand why they set them like that?

2, It seems that the distributionover the probability of the transition 1➝1 is kind of made by strong assumption in the stick-braking representation, where authors simply stated that the b value is corresponding to a(m) and parameters are γ and δ, but I don’t think they can just omit b(m) in the stick-breaking process.

3, The model is called infinite, and authors tried to take the infinite limit. However, at Equation(9) the M+ notation denoting the number of Markov chains that switch on at least once between 0 and T is kind of confusing. Because although the number of features may increase (as new features added in) and decrease (as they remove the unused features) at each iteration so maybe at last the feature is finite, but M+ is kind of like it keeps all counts of the chains that switch on at least once, so seems M+ is large and keeps growing (as every new chain is added in). Authors said right after the sentence as “M+ is the effective dimension of our model”, so does it means that M+ will not keep those Markov chains which becomes s(t,m)=0 even though during some part in the middle there is a switch like s(i,m)=0➝s(i+1,m)=1? Then that is contradict with the “at least once”.

And I didn’t see any “infinite” features during the experiment part. They started the initial S matrix with 10 features, which is not like what they stated in the Indian Buffet Process which uses the Poisson probability for the number of initial feature(dish). Maybe I misunderstood the word “infinite”?

Besides, the derivation of Equation(9) is not stated in detail. Authors said we can find it in one of the authors’ website, where I tried but found nothing related to the equation.

4, I have a question about the “unused feature”. Does the “unused” mean the feature at current time is equal to 0, or the feature is 0 all the time? I don’t think it’s former; but if it’s latter, how can authors be sure that the feature will not be 1 next time (although with a very small possibility)? They can surely add a new feature and don’t care whether this “new” feature was removed from the list or is really a new one, but then they throw the “history” which in a temporal model is very important.

5, In the session(2.3), authors gave a concrete representation which is very good for readers to understand. However, I am a little bit concerned about the customer’s probability of dish m. I think it should follow the Bernoulli distribution stated in Equation(4) and (5), but seems that authors used the Beta function directly. (The probability)

6, Figure(3) is kind of confusing. Does it mean that a(m) determines all s(1:m-1)? Otherwise where are other s(1,….,m-1)? And as “the likelihood cannot depend on θm if s(tm)=0”, then what else it depends on? Or does it means that s(tm) is forced not equal to 0? (I don’t think it’s possible for all s(tm) to be 1.)

7, In the inference part, authors introduced an auxiliary slice variable μ, which is kind of like a threshold to decide which feature should be resampled. But I am not sure if this is the correct understanding. If so, why authors bothered to choose a Beta distribution, instead of a seemingly more reasonable Uniform distribution? And I don’t see any operation related to μ in his experiment part, so it’s very weird to see authors saying “for some of our experiments we use the more flexible Beta distribution”. Not the same experiment?

Besides, authors said they will remove all unused features. Again, I wonder if this means the features are equal to zero for all datapoints (all time stamps), or they have a threshold (if so, what it is?).

The initial sample a, b are sampled from where and how?

8, For experiment, I think authors are very cunning to choose this speech data and compare their model with iICA, because it seems iICA deals good with signals that are, from its original meaning, not only binary but frequency. (I didn’t read the other paper so I am not sure if this is true.) The model proposed by authors works only with binary outputs, say, speak or not speak. So I am skeptical about the result, because the ground truth has a large pause so for iFHMM authors craftily chose a larger value of Gamma parameter to express the large stretch time so we can see from the figure that the white areas tend to be larger. Authors didn’t say anything about their setting for iICA. Maybe for iICA there’s no such parameter to choose the time pause or authors set it improperly and that’s why it tries to fit every piece of datapoint at each time, which may lead to an overfitting. In the second experiment it seems iICA perform better than its first experiment as the number of speakers it recognized is closer to truth. It’s tricky to say the proposed model is better because authors only did the experiments once (I mean he should try when time-interval is 500, 1500, etc for his first experiment, or try 3, 5, 10 microphones with fixed number of time-interval, so we can know if increasing the number of microphones will increase the performance of iFHMM, or increase the time-interval will increase the performance of iICA).

Authors also mention about the mutual information. How to find nutual average information if the column number (# of speakers) are not equal?

9, Foot note 2: does it mean the model is stationary, time invariant?

 

Readability

Authors made a typo error in experiment part (part 4, paragraph 1, last sentence: “…Our first experiment … with more microphones THAN speakers, … less microphones THAN speakers.”)

The rows of matrix stand for time sequence and columns stand for features, but the figure has rows as features and columns as time slices. It would be better if changed to a conformed expression. And it’s better to state the “datapoint” as “time slice” for temporal model.

Written by Yvette

November 9, 2010 at 1:41 AM

Posted in My Work

Tagged with , , , ,

Paper Critique 3

leave a comment »

Paper: Hierarchical Dirichlet Processes

Author: Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, Badid M. Blei

Overview

In this paper, the authors proposed a hierarchical DP model that tries to solve the problem of sharing features of groups within a hierarchical dataset. It can be treated as a two level hierarchical model of the previous hierarchical LDA model, with the difference that it allows the relationship between each group.

The model is analogized into three representations: stick-breaking, Polya Urn which can also be treated as Chinese restaurant franchise, and the infinite mixture models. In the form of sampling, the authors also proposed three methods: a straightforward Gibbs Sampling based on Chinese restaurant franchise, an alternative sampling based on an augmented representation that also involves Chinese restaurant franchise, and an improved version on the second one with direct assignment which does not require integrating the global “corpus” level value (G0).

Several experiments were done and shown with each representing a practical model, such as document modelling that tries to find the shared topics among different documents in a corpus, and the hidden Markov models.

Contribution

This model is kind of practical lying in its ability to link different groups by their features/factors. In my own words, it’s like that the previous LDA or Dirichlet Process are trying to cluster data into groups that we can see the obvious differences between each, but this model not only separates data into clusters according to their dissimilarities but also links them similarities (not completely the same, but with a partial similarity).

The proposed model can be applied to many fields especially those areas (such as bioinformatics, speech recognition, and computer vision and robotics) involving shared groups of clustering. The paper showed several examples that can be useful reference for research in these areas.

The other advantage is that this model can be easily extended to multiple hierarchical levels which have more than two levels.

Soundness

The model is well demonstrated and technically speaking, its modelling and sampling skills are thoroughly considered. It adopts different points of views to model the problem, and can finally come to the same result, which implicitly means its soundness. The experiments also proved the feasibility and the consistency, and appropriate modelling.

Notes & Critiques

While reading the paper, I have some notes and questions jotted down.

The authors kept saying that the draw of DP is discrete in nature, but when talking about the analysis of densities (AnDe) the resulting G0 is continuous. I haven’t time to check what the AnDe should be like, but from the sentence here I feel like a contradiction.

Because of the discrete nature of DP, it is unsuitable to general problems. If we make it to continuous, then it cannot satisfy the goal to cluster groups. This is kind of like a trade-off, and one solution (i.e. this model) is to keep G0 as discrete, but its hyperparameters H can be either continuous or discrete, which can then be applied to continuous problems.

The authors also argued that H can also be a draw from DP in multiple hierarchy models; but here in this model, the authors let H be the conjugate to F, which simplifies the problem (because I was thinking what the H should exactly be: learned from the process or an intuitive given distribution?).

There is another model by Muller, which is very similar to this model, with the difference lying at the different assignment to the weights of atoms associated with the global distribution over groups: the former one has equal weight in each group, while the latter model allows different weights, which much more makes sense in reality problems.

I was quite confused about the GEM distribution. It seems to be the same as Dirichlet distribution when comparing equation (10) with equation (11); but if so, why not the authors used the same notation? I guess GEM is kind of like a “uniformed” Dirichlet distribution, and if I have time, I will read the paper written by Griffiths).

The π has different meanings when using different ways to describe the problem. The π’ in equation (5) the same role as the βk‘ in equation (20); it denotes the proportions in the infinite limit of finite mixture model, while stands for weights associated with the atoms in G. However, I thought it should be interpreted as the parameter of the Dirichlet distribution over “topics”.

The random measure G in equation (6) is a DP and therefore the π is its precision parameter, which is also the stick breaking’s weight/ratio.

The authors’ construction for the Chinese restaurant franchise is quite tricky, as they introduced several indicators such as k and t. It is easy for me to mix them up. But then the authors make a tricky technique for sampling by incrementing these integer counting variables.

One of the questions I was thinking is that, this model has kind of solved the problem of “sharing topics”. As most of the cases, topics can be hierarchical by themselves, for example the topic “football” is a member of topic “sports”, and so when we develop the multiple hierarchical levels using this model, would the topics be hierarchical by themselves? It seems not. I was reading Blei’s other paper Hierarchical Topic Models and the Nested Chinese Restaurant Process, which is based on the similar Chinese restaurant scheme and is a simple one, so I was thinking if it is possible to plug this idea into the hierarchical DP model to achieve more functions.

Readability

The greatest thing I like most is that the authors used different representations to express this model, from different perspective. The same resulted model may give us the impression that the model is well-reasoned, and conform to what we perceive in reality, otherwise there may be contradictions when standing at another point of view.

The use of different methods provide me a more completed understanding of this model, and the authors stated the method in a concise (which may be brief at some part such as the stick-breaking method) but precise way, that is really clear for readers.

The examples given in this paper are also good for understanding.

In the Chinese restaurant franchise session, the notations are kind of easy to mix up.

Written by Yvette

November 2, 2010 at 8:24 AM

Posted in My Work

Tagged with , , ,

Paper Critique 2

leave a comment »

Paper: The Infinite Gaussian Mixture Model

Author: Carl Edward Rasmussen

(I was reading the book Gaussian Processes for Machine Learning written by the same author for better understanding of this paper. Extra reading also includes his tutorial Advances in Gaussian Processes at NIPS 2006.)

 

Overview

In this paper, the author proposed a model that overcomes the normal Bayesian mixture model’s inability to find the proper number of mixture components.

The author introduced a stochastic indicator variable to “replace” the job of the previous mixing proportions π. The model is derived by first using a finite model with k components to give readers the concept of the Gaussian mixture model, and then extending the variable to infinite in order to get a generalized infinite model.

The author implemented the model using Markov Chain Monte Carlo and got the good performance on a 3-D dataset.

 

Contribution

The contributions that are mentioned by the author are summarized and analyzed as below:

1, the most impressive advantage should be its ability to deal with practical problems that the number of classes is not limited.

2, the number of represented classes is learnt during the model process because during the computation process, the author carefully considered three situations so that all classes including the unrepresented ones will be assigned parameters.

3, previous method for finding the maximum likelihood is mostly by using EM step, which is low and easy to stuck at local maximum (here is one thing I don’t quite follow: it seems the target is to find the maximum, instead of minimum, but author said MCMC avoids local minima)

4, when using the finite model, the size of classes is often unknown, thus leading to difficulty in practical problems; while by using the infinite, there is no need to know the size because the infinite limit will cancel the “size”.

 

Soundness

The derivation part of component parameters (session 2.1 of the paper) is done by finding the conjugate prior distribution according to the same expression of likelihood distribution; the conjugate distribution in return, multiplying with the likelihood, yield the conjugate posterior distribution. By using this process, the author wrote down the prior and posterior distributions for each parameter. The detailed algebra is not revealed, however, a general look at the distributions seem to be technically correct. (I only checked how to get equation (4) and (5) and could only get a general derivation.)

Finding the correct model to represent distribution is not at random and the author really did it in a strict and reasonable way. For example, in equation (6) the component precision is given a Gamma priors, and then in equation (7) the parameter β and ω are assigned as inverse-Gamma and Gamma form. To my knowledge, gamma distribution is the conjugate prior distribution for the inverse of the normal variance and for the mean parameter of the Poisson distribution, as sj is the component precision (the inverse of variance) and ω-1 is the mean, the chose of Gamma quite makes sense. The inverse-Gamma is the conjugate prior distribution for the normal variance, it is best to describe β’s distribution as inverse-Gamma because β stands for the “shape”, in other words, the variance, of another parameter sj. The chose of Dirichlet in equation (10) is also reasonable.

Therefore the proof is technically sound.

The author seems to be very precise, for:

1, when he was making decision of sampling, he did some experiment to find out finally that MC is the approach.

2, he also took into account three possibilities during the computation of conditional posterior class probabilities, and listed them, and made a complete consideration to include especially the unrepresented class. By doing this, the model not only updates existing class, but is able to add new class and remove empty class.

3, he also gave the generalization of multivariate observation. Although only one simple paragraph, it points out the guide for other researchers.

 

Notes & Critiques

Below are some notes and questions I made.

“Classes” in this paper seems to be the same meaning of different “functions” that represents smooth distribution curves.

The most successful part should be the indicator variable, c, indicating the class. Because the author didn’t mention the reason why he added this new variable. However, readers found out the function of c later in the expansion to infinite model: mixing proportions can be reduced (that’s what I mean “replace” in the Overview) by the indicator; although the number of mixing proportions will be infinite, the number of class indicator is finite, so inference can be done only with the conditional prior that only depends on the Dirichlet parameter and number of observation.

y ̄j in Equation (4) is a sufficient statistics that the component means μj depends on the mean of observations. njsj +r is like a scale at the denominator.

A new variable is introduced in equation (4), called occupation number, nj. This is the number of observations belonging to class j. Remember that the indicator variable c isto encode which class has generated the observation, so there is a relationship between nj and c, the Kronecker function.

It seems to me that the equation below equation (9) has no conjugate of its original density form is because it is an inverse-Gamma. (I could not find a conjugate of inverse-Gamma.) The author is very precise that he wouldn’t show us a form that is difficult to sample; so he transformed it into log form, and suggested to use Adaptive Rejection Sampling (which I am reading).

For the example in figure 2, I could hardly understand. For example, why the sum of covariance coefficients is correlation length? Maybe I need to read extra materials for Monte Carlo method.

 

Readability

This seems to be a conference paper because it is very short, and does not give any detail of the equation derivation, nor reason why he chose to do so. The target readers seem to be experts with good background of Gaussian Process and normal distribution conjugating; so the Mathematic equations have threatened some people away (I know someone chose to do another paper critique after he read this paper).

Luckily I was reading his book so I got some information from it. Although that’s not sufficient to understand his whole paper, from his work, I can feel the author himself is a great researcher as he tried to do things in a strict and precise way. And his conclusion is very concise but impressive, especially the summary of his contribution, with good points but no extra words.

Written by Yvette

October 24, 2010 at 9:31 PM

Paper Critique 1

leave a comment »

Paper: Using Bayesian Networks to Analyze Expression Data
Authors: Friedman, Linial, Nachman and Pe’er

Overview

This is a biological research paper that presented a new method of finding the causal relations and interactions between genes. It uses multiple expression measurements whose framework is based on Bayesian networks: the Multinomial model used for discrete data and the Linear Gaussian model used for continuous data. In order to analyze expression data, the author mainly focused on examining two classes of features: one is the order relations, and the other is the Markov Blanket relations. He used the above two models to learn the two relations respectively, and drew reasonable conclusions.

Previous Work, Comparison, and Contribution

There are lots of related work to analyze the biological data, and the author mentioned them from time to time so to make a comparison. Most of the previous works were clustering data into groups, and trying to model a structure to explain the data. The author’s work is somehow quite different, lying in his attempt to learn the intra structure of the data instead of to discover the clusters. It is like that the author used certain model to find new interesting things, while the previous researchers were constructing the most suitable theory (or model) for existing things.
The author applied his approach and got some interesting findings. The ordering relation returned a list of dominant genes, that is explained by the author as causal ancestor elements of others. These dominant genes were previously proved to be essential in cell functions, and therefore the finding is quite meaningful and conformed.
The Markov relations returned something exciting. The author explained that this relation indicates that these two genes are related in some joint biological process. One finding is that, for some pairs with high confidence score, no links appear in the expression graph; while previous work showed correlation between these pairs. The author stated that this actually unveiled the conditional independence and that it is explainable according to biological knowledge.
To sum up, his main contribution is finding a new method to learn different kinds of relationships, some of which are undiscovered or unnoticed before, among genes without any prior knowledge.

Soundness and Suggestion/Critique

There are several reasons for choosing Bayesian networks. Genes have their own probabilistic nature, while Bayesian networks is a model dealing with probability, and it is best suited to handle noise and estimate confidence in biological process.
The author also mentioned that Bayesian describe local interaction and direct dependence between small number of components. Since genes are sparsely distributed and have strong relationship among a relatively small group, Bayesian networks is good at analyzing their expression patterns. Besides, Bayesian also provides the causal influence, which can better describe the gene data. For example, the parent can be treated as the cause.
Two conditional distribution representations were offered by the author, the Multinomial and linear Gaussian model. The first one is to deal with discrete variables, and the latter deals with continuous variables.
The choice of Gaussian distribution is reasonable because variables are continuous and real valued so they must satisfy a normal distribution which is Gaussian. Here the author supposed the distribution is linear: children have normal distribution, and their variance is independent of the parents’ values. However, the author didn’t mention why he thought the representation should be linear in this way.
Furthermore, the author chose Multinomial distribution for discrete variables, but a better alternative should be Dirichlet distribution, since Dirichlet includes the confidence parameter and is actually a generative Multinomial distribution with confidence. In the author’s later work we would see that he adopted extra algorithm to estimate confidence value, and then spent lots of efforts to make up the by-effect resulted from the bootstrap. If he used Dirichlet, things may be easier and result may be more accurate.
It seems that the author mainly focused on the causal influence, because it is easy to find correlation between one and the other, but to interprete their relationship such as cause-effect takes more effort. Up to now, it is not possible to involve an intervention experiment, so the author could do nothing more than observe, which may not correctly infer cause-effect relation. Therefore the indirect way here is to learn an equivalence class such as PDAG and kind of recover the arc direction between genes.This is done with the assistent of the mentioned measurements, especially the order relation, which exactly tries to figure out if there is a directed edge between two genes. This is kind of tricky, and incomplete in finding all cause-effects. But from positive perspective, it is also an advantage of this approach, because it is able to learn some cause-effect without any intervention, while normal observation method cannot.

Evaluation and Critique

This theoretical model was taken into practice by performing two experiments for multinomial distribution and linear Gaussian distribution. Each experiments were conducted by several tests. The data is from someone else’, and the dataset size is relatively small (maybe not small for biological research, but to a computer science student, 800 genes is really a small number). In order to achieve accurate performance, the author made lots of efforts to increase the credibility.
For example, there is a bootstrap method to estimate confidence which is to prevent false positives, but at the meanwhile may cause some artifacts. So he created some random dataset to test the credibility of his confidence assessment. Then the test returned expected random results, so he got the evidence to say that his confidence assessment is safe.
There are several assumption and approximity. First is that each measurement is treated as independent sample from a distribution. As the genes have sparsed distribution, and the independence occurs locally according to biological knowledge, this assumption is quite reasonable.
Second, the temporal condition is not considered. Instead, the author introduced a cycle phase and placed it at the root in the expression model. This compensation seems to be made only in hope of decrease the computation complexity. The author obviously noticed this and pointed it out as a future improvement.
Besides, since different threshold level used for discretization of the expression level may result in different discrete expression patterns, the author chose the threshold by testing and repeating. This is also a weak algorithm, because then the author had to make a normalization in order not to miss small variation in some genes. The normalization was made to each gene, leading to same mean and variance for all genes. Based on the normalized dataset, he rescaled the discretization factor depending on each gene’s variance in the data. There is a by-effect that will easily affect Markov relation result because this measurement detects the direct edge between two genes so it is very sensitive to local changes. Other effect is that this normalization will amplify the measurement noise, which, however, is not a big problem in this experiment since the author only focused on genes that display significant changes.

Readability

As mentioned in the beginning, this is a more biological paper, so it dose not include too many Mathematic formulas. The background knowledge was introduced, such as what Bayesian Network is, and how equivalence graphs should be defined. But they are abstract and not in detail. The whole implementation is stated only in high-level, no procedures organized from his theories. Maybe this is because the potential readers are non-technique people so the author didn’t want to go such detail.
One good thing is that, comparisons with previous works are made everywhere so that readers can easily see the contribution and advantages of their new approach.

Written by Yvette

October 12, 2010 at 12:06 AM

Posted in My Work

Tagged with , , ,

What impacts Web2.0 has on Informatics

leave a comment »

The essay for TUM Informatics’ application. I think it’s useless now so posted here. No strong evidence, no logic, just bull shits. Although many problems may happen, I do hope the described era come, like the Ghost in the Shell world.

Still, bull shits.

All rights reserved by YU Wenye.


What impacts Web2.0 has on Informatics

In the time of Web2.0, obtaining information becomes easier and more flexible. As new technologies are developed for better collaboration and interaction, the way of browsing websites is changing and the impacts of Web2.0 can be viewed from three aspects: user-oriented Internet, software as online services, and virtual community.

1.    User Oriented Internet

From users’ point of view, with the widespread of Internet and increasing number of users, Internet tends to be more organic as human have dynamically involved in the construction of websites. One of the differences between Web1.0 and Web2.0 is that Web1.0 builds content-content based websites as information source, while Web2.0 as a participatory platform aims at user-content relationship that emphasizes the interaction between user and websites so as to give users dynamic access to information on their specific requirements: users no longer wait for experts to establish information on certain websites; instead, users become the main players that can create and customize their own profile to express their ideas.

Wikipedia is a successful example. This online encyclopedia becomes so popular because of its huge and free database. Data source is not limited within several portals but distributed to individuals, as everyone is invited to join the process of creation and edition. User then becomes both author and reader. At one hand, publishing contents ensures more information from various sources to be included in the database; at the other hand, allowing users to make modification to websites contents through commenting and other feedback system enables information to be sorted and re-organized according to users’ favor. Search engines rank retrieved results by users’ interests. Different interests do not make the whole Internet a disorder place, but in fact tend to select and keep information that is useful to users, and eliminate data that is meaningless. Tag is another technology that lets users to do the classification job hence information is categorized at the top level by logic and common sense.

All the techniques mentioned above help develop a personalized Internet that provides a platform for users to perform their particular behaviors.

2.    Software as Online Service

From companies’ perspective, web services and online applications are the trend in the future. Software is no longer bought at retail shop stores nor isolated on individual PCs, but provided as web services in a distributed environment in order to share data and information. Take Google as an example. Most of Google’s web applications are free and users are only required to have a logon account before using. Google never sells its products, but asks clients to pay for extra services.

The ultimate goal of any products is user-oriented which is to provide comfortable user experience, and therefore there are several reasons for replacing stand-alone applications. First of all, there will be no installation, nor any platform incompatible problem. Browsers present to users the web services, which are beyond simple browsing but all kinds of functions that ordinary software could achieve. Secondly, data and information are shared across the Internet, reducing duplication. For example Google requires only one account that is universal for all its applications. Thirdly, previous design pattern are not suitable for current online applications since software life cycle will never ends and what needed now is continuous testing and improving once the application is published. Most online services are open source so that users could take part in application’s development. Instant feedbacks are returned and updating can be made in time to one central server.

Nowadays, almost all common stand-alone software can find their online substitutes, and with the rise of cloud computing, it is believed that software as online services will dominate the market in the near future, and maybe an online operating system will replace the current desktop applications.

3.    Virtual Community

In respect of the whole network, Internet is getting more and more like a virtual community. The initial objective of Internet is to build a platform for sharing and collaboration. The introducing of new techniques such as AJAX gives more freedom in communication. Information is decentralized and structure is integrated. Now Internet has evolved into a stage that not only offers direct interaction between user and user, but also facilitates a realistic virtual world.

User-user interaction results in community and social network websites. Users with similar taste are congregated to form a community where thoughts and opinions are exchanged. Social network sites are hence booming, causing the virtual community more like the daily society world. Web2.0 calls for real identity in social networking websites. For instance, user is not forced to provide his real name and other information at Facebook. However, if he registers with a fake name, then his friends may not find him and therefore he loses the chance to participate in related social circle.

Online applications and web services could accomplish a number of tasks just as in real world. The famous PayPal, also considered as a Web3.0 example, gives users the real commercial experience; other companies like Amazon also provide convenient virtual business to users.

There are still some concerns on security and authority problems, however, with the development of hardware and communication technology, all kinds of tasks will be conducted online and Internet will integrate as a virtual community that behaves no difference with real world.

The core feature of Web2.0 is its concentration on human, rather than contents. Based on users’ requirement, new techniques are introduced and new structure is developed. However, Web2.0 is not the ultimate goal. Sophisticated algorithms may be applied for intelligent interaction between users and applications. Database and information retrieval are important since applications are driven by information. As mentioned above, concept of Web3.0 is coined with a profitable business model and a new era is coming.

Written by Yvette

May 31, 2010 at 11:45 PM

Posted in My Work

Tagged with , , , ,

理论联系实践

leave a comment »

虽然当年学data structure的时候说过当sort很小数据的时候bubble sort就够了。但我上网搜了一下所有sorting algorithm的big O(因为我忘记那时候的排序了),当数据在5000个以下时,shell sort是最快的。
然而!经过亲身实践,shell sort竟然不能通过IE的JavaScript的debug!因为太占CPU(或者太慢)!或者是IE太cheap了?
反倒是bubble sort很快就sort出来了。
真是无语了。我的array size是17。看样子bubble sort很适合如此短长度的array。

理论联系实际,那个big O的排序也只是个大概。果然尽信书不如无书也。

Written by Yvette

February 6, 2009 at 12:12 AM

Posted in Tech Problems

Tagged with ,