Archive for the ‘Computer Science’ Category

git’s versus svn’s storage efficiency

Tuesday, July 8th, 2008

At Codeyard we maintain a git and a subversion repository (which are synced with each other) for each of the >115 projects. The following graph shows the repositories plotted logarithmically according to the size of their whole server side subversion repository horizontally and their git repository size vertically:

To make more sense of the logarithmic nature of the graph, I’ve added three lines. The first (solid black) indicates the points of which both sizes are equal. The second course dashed line indicates the points of which the subversion repository is twice as large as the git repository. And lastly, the third finely dashed line indicates the points of which the subversion repository is five times as large as the git repository.

All projects for which git is less storage efficient, are smaller than 100Kb. The projects for which git is most storage efficient (up to even 6 times for a certain C# project), are all of medium size (10–100MB) and code-heavy. For the other projects, which are blob heavy (eg. images), git and subversion are close (git beats svn by ~20%).

One notable disadvantage of huge (someone committed a livecd image) git repositories, is an apparent \geq2N memory usage of git repack even if I tell it not to with --window-memory.

ssmtp segfault in sendmail on Gentoo

Tuesday, February 26th, 2008

Disabling the md5sum useflag might fix it.

Thanks to Bram for the tip.

Evolving the Object Paradigm

Thursday, January 31st, 2008

Kaja is writing a series of articles on the shortcomings and solutions to the current object paradigm. Very interesting.

Linux 2.6.24-bw-r34

Sunday, January 20th, 2008

Again an update for my bw-tree. There isn’t a tree that includes reiser4 and TuxOnIce without a lot of other bloat, so I created one myself.

Download the big diff or the seperate patches broken out.

Note I pulled in new patches for reiser4 from the -mm tree against -rc8 which should fix the Reiser4 flush problem a bit.

Update Patched against 2.6.24 stable. New TuxOnIce patched added and the genpatches. Please note that I haven’t tested Reiser4 thoroughly enough on this version.

linux 2.6.24-bw-r12

Thursday, December 20th, 2007

There isn’t a (stable-ish) tree that includes reiser4 and TuxOnIce (suspend2), so I made one myself, based on 2.6.24-rc5-rc67.

You can grab it as one big patch, bw-r12-for-2.6.24-rc5-rc67.diff.bz2, or broken out: bw-r12-for-2.6.24-rc5-rc67.tar.bz2.

rebuild-scm-ebuilds

Saturday, December 8th, 2007

Two very simple ruby scripts (I love ruby!) and a little bash script that searches all ebuilds with 9999 in the name, orders them by dependencies and then emerge -a them. Very useful when you got a lot of 9999 ebuilds.

CaCert.org

Saturday, December 8th, 2007

CaCert is a Certification Authority that works with a web of trust: people meet and assure (similar to keysigning) eachother. If you’ve been assured by enough people you’ll be able to let your ssl server key be certified by cacert. It’s a lot more secure than other CA’s who just give anyone a certificate who pays enough.

Still a hierarchical system with a CA is flawed. When the CA is compromised, the whole system fails. PGP’s web of trust hasn’t got this weakness.

(Got a nice shiny cacert certified ssl certificate on my webserver now)

Ulrich Drepper on Memory

Saturday, October 13th, 2007

Kernel hacker Ulrich Drepper has written a lengthy paper on the design and especially performance characteristics of memory of modern consumer hardware and it’s written for programmers. A great read so far (it’s published in parts by LWN):

What every programmer should know about memory by Ulrich Drepper.

Stupid PHP (1) (Strings are faster than Arrays)

Sunday, September 23rd, 2007

When I slowly build a big string out of little bits, the worst thing to do in most languages is to just use string concatenation:

for(something) {
 str += little_bit;
}

Why? Everytime a little bit is added to str, there must be a new string allocated big enough to contain str and the new little bit. Then the actual str must be copied in. In most languages there are constructs to efficiently build strings like this instead of concatenating. StringBuffer in C#. StringIO in Python.

But no, PHP has to be stupid. There is no nice construct and you’ll end up using concatenation. So, I thought to be smart and make use of PHP array’s and implode. Arrays are here for having elements added and removed all the time so they are properly buffered and should be great at having lots of small elements added. And when I want to pack it all into one big string, I can use PHP’s builtin implode function.

I wanted to try it out and created two scripts: a.php concats a little (10byte) string one million times and b.php appends it to an array and then implodes it. And because I’m also interested in the performance of implode I got a script c.php that’s identical to b.php but doesn’t implode afterwards. These are the results:

a.php (concat) 0.320s
b.php (array append and implode) 0.814s
c.php (array append) 0.732s

Indeed, string concatenation with all its allocation and copying is actually faster than plain simple array appending. PHP is stupid.

Stupid IE (2)

Sunday, September 2nd, 2007

To create a multiple select-box with javascript you need a very ugly hack in IE.

if (navigator.appName.match(/Internet Explorer/)) {
	fsel = document.createElement(\'<SELECT MULTIPLE>\');
} else {
	fsel = document.createElement(\'select\');
	fsel.multiple = true;
}

Stupid IE (1)

Sunday, September 2nd, 2007
if(!Array.prototype.indexOf) {
	Array.prototype.indexOf = function(el) {
		var i = 0;
		for(; i < this.length; i++)
			if(this[i] == el)
				return i;
		return -1;
	};
}

The (or at least a better) Solution to Spam

Tuesday, August 14th, 2007

There is no easy way to distinguish between a human and a spambot. It’s an arms race which we’ll always be behind. I’m talking here about spam in more general—not only on e-mail but also on for instance Wikipedia or on blogposts. Even if we would have a perfect-solution to test whether there is a human behind something, we still have to combat cheap labour in India: real people spamming “by hand”.

I think the solution is a Web of Trust similar to that of PGP. An identity (not necessarily a person) publishes who she trusts (not to be spammy/phishy) or not trusts. Ideally everyone would be in one big web. Only someone who my blog via-via trusts may post.

Obviously one still may gather trust of people over time and enter the web of trust and then start spamming with that identity. However, then that identity will be marked untrusted by people and also the people who initially marked the identity as trusted will be less trusted. Also, there are way more sophisticated measures of establishment in the web/trust to conceive than just being trusted via-via by one identity.

There is no way to prevent spam perfectly, but the amount of work that has to go in to making an identity trusted and established in the web is several orders of magnitude greater than any other protection we have. The big problem: we don’t have such an ubiquitous web of trust yet. (Yes, it’ll be in SINP if I’ll get around to working on it)

bas@fsfe.org

Monday, August 6th, 2007

I just got back from Wacken Open Air 2007 (it was great!) and noticed that my bas@fsfe.org account has been activated :).

linux 2.6.22-bw-r1

Thursday, July 12th, 2007

There isn’t a tree that contains reiser4, suspend2 and the gentoo patches — so I created one. This revision adds the -stable patches and the new gentoo patches.

One big patch: bw-r1-for-2.6.22.diff.bz2
Patches broken out: bw-r1-for-2.6.22.tar.bz2

To apply the one big patch, use: bzcat bw-r1-for-2.6.22.diff.bz2 | patch -p1 inside a vanilla 2.6.22.

Online Cycle Detection in Directed Graphs

Monday, July 2nd, 2007

A short while ago I came across a quite interesting problem. Design a datastructure (and algorithms) to maintain a Directed Acyclic Graph.

There has to be only one operation: adding a link between two given nodes. This operation must be able to detect and deny any new link that would cause a cycle. For simplicity, nodes are identified by sequential ids starting with 0.

An example:

addLink 0, 1 -> True
addLink 1, 2 -> True
addLink 2, 0 -> False # Fails because it would create a cycle

addLink 0, 1 -> True
addLink 1, 2 -> True
addLink 0, 2 -> True # This isn’t a real cycle, so it’s perfectly fine

It’s rather trivial to create a \mathcal{O}\left(n+\ell\right) (where n is the number of nodes and \ell the number of links). I conjecture there exists a \mathcal{O}\left(\log n\right) algorithm.

Syslets and other untrusted kernelspace programming

Monday, June 11th, 2007

In the wake of creating true async. IO in the linux kernel Ingo worked on so called syslets. A syslet is nothing more than one async system call to chain together a few arbitrary system calls. An example syslet would be:

1. Read from file descriptor 123
2. If 1 returned with error, break
3. Write which was read to file descriptor 321
4. If 3 succeeded jump to 1 otherwise break

This is represented by a few struct’s linked together of which the first is passed on to the kernel with the syslet system call. The systemcall returns practically directly and eventually the application can be notified (or can wait) on the syslet to complete. This could safe an enormous amount of system calls. This means way less context switches, which is very good for performance.

Syslets dawn, in a very primitive manner, kernelspace scripting. But hey, wasn’t kernelside scripting the thing that all linux dev’s dreaded? Wasn’t it Linus who joked that he would be in a mental institute whilst releasing Linux 3 with a VB message pump? Yes, they’re afraid of putting untrusted programs/scripts in kernelspace and they’ll barely acknowledge that syslets is the first step.

The problem with the current full-featured scripting languages is that they are, well, full-features gone wrong: they’re bloated and not really secure. In kernelspace you can’t allow any memory except for the scripts’ own to be accessed, not to mention the restrictions on resources and virtual memory won’t help you there. Most scripting languages weren’t developed with these restrictions in mind. Most languages have got evil functions in dark corners of the standard library that will allow you to do really evil stuff with memory.

As far as I know only .net (thus mono) have got a quite rigorous trust framework build in. .Net is bloated and proprietary and Mono is still (and probably will never be) feature complete though still being very bloated.

A very simple safe language is what we need, of which a compiler daemon is running as a system service, with which untrusted userspace programs can have scripts running in the kernel. I’m tempted to use one of the brainf*ck JIT’s, they’re small enough to thoroughly review :).

A kernelspace interpreter would do to, though, as a PoC.

Linux 2.6.21-bw-r1

Sunday, June 10th, 2007

There isn’t a tree that contains reiser4, suspend2 and the gentoo patches — so I created one. This revision adds the -stable patches, new gentoo patches and some powertop patches.

One big patch: bw-r1-for-2.6.21.diff.bz2
Patches broken out: bw-r1-for-2.6.21.tar.bz2

To apply the one big patch, use: bzcat bw-r1-for-2.6.21.diff.bz2 | patch -p1 inside a vanilla 2.6.21.

Upgrading wordpress with git

Thursday, June 7th, 2007

I didn’t like upgrading wodpress much. Everytime I did it, I needed to re-apply all my little tweaks to the new wordpress. It took too much time.

I tried to diff -uNr on the current version I was running and the newer version and then applying the resulting diff to the current version, but it seems wordpress has been backporting changes so I got conflicts, quite a lot of them.

Because I was quite tired of porting my changes, I’ve tried git, the Source Code Managment tool used by the linux kernel, to do it for me:

I did this all in the parent directory of the root of blog.w-nz.com. This folder contains:

  • htdocs current installation (2.1.2)
  • 2.1.2 the unmodified wordpress
  • 2.2.0 the new wordpress I want to upgrade to

First, I created an empty git repository:

mkdir git; cd git; git init-db; cd ..

Then I copied over the unmodified version of wordpress I was running, and commited them:

cp 2.1.2/* git -R
cd git
git add *
git commit -a -s
cd ..

Then I copied over my current installation:

cp htdocs/* git -R
git status # lets see what changed

There are lots of files like uploads I want git to ignore, so I edit .gitignore to make git ignore them. There weren’t any files I added though, otherwise I’d had to run git add to let git know.

And let commit my changes:

git commit -a -s

Now, lets go back to the original commit — the clean 2.1.2 wordpress — and start a branch from there:

git checkout HEAD^ # HEAD^ means parent commit of HEAD: the previous commit
git checkout -b tmp # create a new branch tmp from here

Now I’m in a branch without my own changes, which was forked from the master branch. Lets apply the new wordpress on this branch:

cd ..
cp 2.2.0/* git -R
cd git
git status # see what changed

git-status showed me that there are a few new files in wordpress 2.2.0, I git-add-ed all of these new files. And then committed it all:

git commit -a -s

Now I’ve got two branches:

  • master which contains wordpress 2.1.2 with my own changes on top as a commit
  • tmp which is forked from the wordpress 2.1.2 from the master branch without my own changes but with the 2.2.0 changes on top

What I want to do is to reapply the 2.2.0 changes on top of my current changes’ commit instead of on top of the 2.1.2 commit. To do this, git has a very powerfull util called git-rebase:

git rebase master

This will search down the tree until the point where the current branch (tmp) forked from the target branch (master). Then it will re-apply all commits in between on the latest commit of the target branch.

Just like if I’d use diff/patch I get a merge conflict. git rebase lets me know this and git status shows me which one are these. The one little difference with the diff/patch approach is, that there are way less merge conflicts (git is smarter) and that the merge conflict are way easier to identify and they’re inline in the original files. Not to mention that when I would have fucked up I’d always have a way back.

After I fixed the merge conflict, I git update-index each conflicted file (to tell git it’s resolved) and git rebase --continue-ed.

Now I’ve got my updated wordpress in the git folder. Then I backuped the current, copied over from git and visited wp-admin/upgrade.php and I’m done :).

By the way: “I didn’t say Subversion doesn’t work. Subversion users are just ugly and stupid.” — Linus on this Google tech talk.

Sidenote, I switched from Hashcash to Akismet. Hashcash didn’t work anymore and Akismet theoretically should be the best solution because it isn’t based on security by obscurity.

Elastic tabstops plugins for gedit

Monday, May 28th, 2007

Spaces vs Tabs? Nah: elastic tabstops, now for gedit.

The Future of the Internet

Sunday, May 6th, 2007

One of the prominent people behind the current internet discusses the history (telephony, wire oriented), the current (IP, endpoint oriented) and the future (?, data oriented) at google tech talks.

A short synopsis: the internet is having trouble at the moment for it has been designed in a time the problem was different. In these days most of the data is duplicate data, which is a tremendous waste. Also connecting to the internet (getting an address) (and resulting from that keeping everything in sync) is hard. Van suggests and predicts a data oriented internet. A bit like a secure P2P bittorrent network, but instead of on top of IP on the IP level.

It’s a very interesting talk, worth watching.