Jan 28, 2009

Understanding Git Concepts

Git in fact is a file system with history.

All data is saved in git objects. All git objects has a 40bits id which generated by SHA-1 hashing the object's content. There're 4 types of objects:
  • blob object: File contents is saved in blob object. No filenames/permissions etc, only contents is saved here.
  • tree object: Directory structure is saved here. Tree object's content is just a list of its children, either blob object or tree object. A list item will contain either a SHA-1 hash point to a blob object with filename/permissions/etc. or a hash point to a tree object. Here we have got a data structure(tree) which can represent a file system.
  • commit object: Now we need history. A commit object simple contains a pointer to tree, one or many pointer to parents(also commits) and some booking data like commiter. Commit objects in fact forms a tree graph on a higher layer of blob/tree.
  • tag object: Tag object is just for referencing an object conviniently. A tag object can have a pointer point to any other git object and a tag, then you can use the tag to reference any object(like an important commit) in your git repo.
A git object is immutable. Another concept in git system is Reference, for referencing mutable things like branch and remote.
  • Branch is just a file in .git/refs/heads/ dir contains the SHA-1 hash of the most recent commit to that branch. When you create a branch in git, git just create a file contains a 40 bytes hash in .git/refs/heads/, and update .git/HEAD to point to it. With your development moves on, git will find current branch in HEAD and update the branch file in refs/heads correctly.
  • Remote is a pointer to branch(so it's also a branch) in other people's copies of the same repo. If you get the code by clone instead of 'git init', git will add a default 'origin/master' remote branch for you automatically. 'origin' point to the remote copy location, and 'master' means which branch on remote you cloned from.
When you ask for checking out, git will lookup the argument you provided in .git/refs or .git/HEAD, find the corresponding object/branch/tag/whatever, read the SHA-1 hash which points to a tree from its content, then traverse the tree.

A fetch will merge all updates on a remote branch to your local. By default it will merge in changes on origin/master, but you can fetch updates on other place like origin/cool. After a series of fetch/merge your history graph will looks like a mess, rebase will help. Rebase will leave orphan objects in your repo(you can use 'git gc' to clean it) and should not be used on a repo which can be fetched by others.

Jan 25, 2009

Evolutionary algorithm example in Ruby

#!/usr/bin/ruby -w

# This program is based on this evolutionar computation introduction:
# http://blog.uncommons.org/2009/01/20/practical-evolutionary-computation-an-introduction/
# usage: ruby evolution.rb [generations] [goal]
# example: ruby evolution.rb
# ruby evolution.rb 100
# ruby evolution.rb 100 "I make the universe"

# We can set a goal(string) for evolution, and the Evolution object
# will evolute towards the goal you set.
class Evolution
attr_accessor :set

# The goal(string) can contain only upcase characters and space
# CHARSET in fact is gene pool
CHARSET = ('A'..'Z').to_a + [' ']

# goal: evolution goal is a string, like 'hello world'
# population: the population of the society. default to 100, means
# there're 100 parents in the initial environment. And
# the environment can only support 100 livings.
# mutation_rate: the possibility of gene mutation. default to 0.01,
# means a gene 'A' in one generation has 1/100 possibility
# to mutate to random gene in CHARSET
def initialize(goal,population=100,mutation_rate=0.01)
@goal = goal
@population = population
@mutation_rate = mutation_rate

# @set is the environment all livings live in
@set = []
@strlen = goal.length

# fill the environment with livings
population.times {|i|
str = ""
@strlen.times {|j| str << CHARSET[rand(CHARSET_LENGTH)] }
@set << str

# evolution function
# reproduce: how many generations should the evolution have
def run(reproduce=1000)
reproduce.times {|i| generation }
sort_and_cut(@set).each {|s| puts "#{s} : #{score s}" }


# one generation
def generation

# give mutation chance to every living in our environment
def mutation
k = 1/@mutation_rate
for str in @set
str.length.times {|i|
str[i] = CHARSET[rand(CHARSET_LENGTH)] if rand(k) == 0

# choose two parent
# produce offspring
def crossover
set = @set.uniq

offsprings = []
if set.length == 1
offsprings = @set
(set.length-1).times {|i|
(i+1).upto(set.length-1) {|j|
pivot = rand(@strlen) + 1
par1_a, par1_b = set[i][0,pivot], set[i][pivot,@strlen]
par2_a, par2_b = set[j][0,pivot], set[j][pivot,@strlen]
offsprings << "#{par1_a}#{par2_b}"
offsprings << "#{par2_a}#{par1_b}"

@set = sort_and_cut(offsprings)

# pick the good candicates (high score)
# score 2 candicate will have high possiblity to be choosen than score
# 1 candicate
def pick
pool = []
@score_map.each {|str,score|
score.times {|i|
pool << str
pool.sort! {|a,b| rand(3) - 1} # shuffle
pool_len = pool.length

@set = []
@population.times {|i|
@set << pool[rand(pool_len)]

# compute score for every candicates
def score_set
@score_map = {}
for str in @set
@score_map[str] = score(str)

# score will tell us the simliarity between the str and goal
# score = the same character on the correct position with goal
def score(str)
score = 0
@strlen.times {|i|
score += 1 if str[i] == @goal[i]

# sort livings by score and only the livings with highest score
# will left. They're the selection of nature.
def sort_and_cut(set)
set.sort_by {|s| -(score s)}[0,@population]


if __FILE__ == $0
times = ARGV[0] ? ARGV[0].to_i : 20
goal = ARGV[1] ? ARGV[1].upcase : 'HELLO WORLD'
e = Evolution.new(goal)
e.run times

Jan 14, 2009

Haskell Functor Typeclass

A correct functor instance should follow two rules:

fmap id == id
fmap (f . g) == fmap f . fmap g

In natural words, functor should keep data's structure, only change it's value. This rule can't be guranteed by compiler, so we have to remember it by ourselves when implement a functor instance.

Jan 4, 2009

Functional interpreter and program modularity

The purpose of programming tech evolution is modularity. To get modularity, people need abstractions. What's the differences between abstraction and type-saving sugars? A good package of common patterns (abstraction) encapsulates a higher level concept which has meaning *independent* of its implementation.

A package is most useful if its behavior is independent of the context(environment/global resource/etc.) of its use. Such a package is called referentially transparent. In other words, the output(behavior) of a referentially transparent package(function) will always be the same with the same input, because the explictly provided input data is the only source it depends. (What does 'same' mean here is an interesting problem, we'll see later)

To make a modular system, it is often necessary to think of a computational process as having state. In such cases, if the state can be naturally divided into independent parts, an important decomposition may be the division of the program into pieces which separately deal with the part of the state.

Referential transparency will permit programs to be divided into parts so that each part can be separately specified without a description of its implementation. The desirable result is that pieces can be separately written and debugged. At first people made no free variable recusive interpreter[1] with seperated variable bindings (Environment) and procedure bindings. It's expressive power is very litmited.

Seperate Env and Procedure symbol tables will make procedure 2nd-class concept (thus you can't define a 'map' function in the interpreter), but merge these two table unintentionally bring in two property: free variables (in fact, before merge, procedure symbols are free variables, but they're not real 'variables' at that time) and dynamic scope variables.

To avoid function name conflicts, it would be nice to have a notation for functions as objects, or rather a way to write an sexp in code that would evaluate to a procedure. Lisp adapted such a notation from lambda calculus of Alonzo Church.

But lambda plus dynamic scoping will lead to the famous 'FUNARG' problem, so we need lexical scoping. What we want is when computing a lambda, use the environment in which it's evaluated instead of the environment in which it's executed. The solution is simply save the environment in the procedure object when evaluate a lambda (or function, they're the same thing, function is just a lambda with a name in symbol table). Within this change, we say that the procedure is closed in the current environment, and the procedure object is therefore called a *closure* of the procedure, or a *closed procedure*.

The problem of lexical scope is in the REPL: the new definition can only refer to previously defined names! We lose the ablity to define recursive procedure. This conflict, between REPL and lexical scope, is unavoidable, because such a incremental interactive top level loop for reading definitions inherently constitutes a violation of referential transparency, which we successfully got in our interpreter. A piece of code can be read in which refers to an as yet undefined identifier (the name of a procedure), and then later a definition for that identifier read in (thereby altering the meaning of the reference). If we insist on maintaining absolute referential transparency, we are forced to eliminate the incremental top level interaction, to give up interactive debugging (we can't redefine erroneous procedures easily), to give up incremental compilation of separate modules.

If we throw lexical scoping away and turn back to dynamic scoping we would lose a great deal of referential transparency and abstractive power. The solution can be a mixture: procedures must not be allowed to refer to variables internal to other procedures, but only to top-level variables existing at the time they are called. Therefore only the future top-level environment is to be included in the procedure object when it is eventually constructed. In this way free variable references will be dynamic only with respect to the top-level environment.

At this stage, we made our functions really referencial transparency, with no side effect. No side effect means no state, no state means you have to pass states up and down (as functions input and output) through the whole system. So no side effects conflicts with modular discipline. We are forced to introduce side effects as a technique for constructing modular systems. But side effects violate referential transparency, now we have two techniques for achieving modularity have come into direct conflict.

The concept of side effect is induced by particular choices of boundaries between parts of a larger system. If a system boundary encloses all processes of interest (the system is closed), we need no concept of side effect to describe that system as a whole in vacuo. If we wish to make an abstraction by dividing the system into modules more than one of which has independent state, then we have by this action created the concept of side effect.

The concept of side effect is inseparable from the notion of equality/identity/sameness. The only way one can observationally determine that a side effect has occurred is when the same object behaves in two different ways at different times. Conversely, the only way one can determine that two objects are the same is to perform a side effect on one and look for an appropriate change in the behavior of the other.

if CONS return new object on every call, then it has side effect! Because with the same input, it will generate different output (a totally new object).

If side effect are to be usable at all, the references to things denoted by variables must not make copies of those things. If the user is to be able to write procedures which produce lasting side effects on their arguments, then there must be a variable binding mechanism which does not make copies.

The ideal equality predicate should follow these two rules:

1). Two objects which are observed to behave differently must not be equal.
2). Conversely, we would like two objects which are adjudged unequal to exhibit differing behaviors under suitable circumstances.

Any useful equality predicate must satisfy 1), but it's hard to satisfy 2). (Another interesting view is, equality predicate should never be false-positive, but may be false-negative)

Based on above two rules: in the absence of RPLACA ("pure lisp"), EQUAL is preffered to EQ (like (==) in haskell); in the presence of side effects such as RPLACA, EQ is preferred to EQUAL.

Finally we found set-use-reset pattern is very helpful for modularity, and dynamic scope captures this pattern well. So we want to have both dynamic and lexical scope variables in our interpreter. We need to maintain separate environments for lexical and dynamic variables in interpreter to avoid certain problems. This will require a special syntax for distinguishing references to and bindings of the two kinds of variables.

Dynamic scoping provides an important abstraction for dealing with side effects in a controlled way. A low-level procedure may have state variables which are not of interest to intermedia routines, but which must be controlled at a high level. Dynamic scoping allows any procedure to get access to parts of the state when necessary, but permits most procedures to ignore the existence of the state variables. The existence of many dynamic variables permits the decomposition of the state in such a way that only the part of interest need be deal with.

[1] LISP was not originally derived from Church's lambda calculus. In early LISP ppl use "McCarthy conditional" to define recursion:

factorial[x] = [x=0 -> 1; T -> x*factorial[x-1]]

while in resursive function theory one would define it like this:

factorial(0) = 1
factorial(successor(x)) = successor(x) * factorial(x)

haskell adopt the later notion, while keep the first one too (case expression)