1. 03 Jun 2013

    binary array search的几种写法(go)

    儿童节那天,对比了python内置的dict和binary array search查找的性能,dict大幅领先,有些出乎意料:

    1. dict虽然是O(1),binary search是O(log n)。当n为1M时,log n也仅有20,并且直观上log n前面的常数不大,理论上应该和O(1)差别不是很大
    2. 怀疑到python语言,binary array search是由手工python实现,而dict是C实现。python语言比C慢很多

    所以,6月2日,用go语言来对比一下,也熟悉一下go语言。 binary search有几种写法:

    1. recusive: 递归
    2. 3-branch: 迭代,每次迭代进行2次判断 ==<(或者>),三个分支。找到后,立刻退出。最好情况为O(1)
    3. 2-branch: 迭代,每次迭代进行1次判断 <(或者>),二个分支。最好情况也为O(log n)
    4. library: go标准库,它是3的通用版本,需要提供一个函数,用来提供><的依据

    2-branch

    3的想法比较有意思,它是先计算出对于要查找的数,如果放入这个已经排好序的数组array,下标是多少。如果在array里,有重复,返回的是则是第一个的下标。最后再对比一下数组里面的数,是否正是需要查找的key

    func binary_search2(arr []int, key int) int {
            lo, hi := 0, len(arr)-1
            for lo < hi {
                    mid := (lo + hi) / 2
                    if arr[mid] < key {
                            lo = mid + 1
                    } else {
                            hi = mid
                    }
            }
    
            if lo == hi && arr[lo] == key {
                    return lo
            } else {
                    return -(lo + 1)
            }
    }

    library

    4的做法很好玩,实现:

    func Search(n int, f func(int) bool) int

    go语言没有模版,没有像java那样的Comparator,没有运算符重载,还是静态类型。但这个接口完全不需要这些,并且非常通用(虽有一点性能的损失)

    3-branch和recusive

    1和2是常规解法了。

    // recusive
    func binary_search0(arr []int, key int) int {
            return binary_search_(arr, 0, len(arr), key)
    }
    
    func binary_search_(arr []int, begin, end, key int) int {
            if begin > end {
                    return -1
            }
            mid := (begin + end) / 2
            if arr[mid] == key {
                    return mid
            } else if arr[mid] > key {
                    return binary_search_(arr, begin, mid-1, key)
            } else {
                    return binary_search_(arr, mid+1, end, key)
            }
    }
    
    // 3-branch
    func binary_search(arr []int, key int) int {
            lo, hi := 0, len(arr)-1
            for lo <= hi {
                    mid := (lo + hi) / 2
                    if arr[mid] == key {
                            return mid
                    } else if arr[mid] < key { // search upper subarray
                            lo = mid + 1
                    } else {
                            hi = mid - 1
                    }
            }
            return -(lo + 1)
    }

    性能比较

    随机生成2k到500k个int,进行1500次查询,纪录下时间。并以内置的map做为对比

    perf

    1. map依然是最快的,但没有像python那样,快得嚣张
    2. go标准库的很通用,也是最慢的
    3. 随着item数量的增多,2-branch相对与3-branch的优势明显:得益于虽然可能迭代的次数增多(不找到就返回了),但每次2个比较,3个branch => 1个比较2个branch。JDK里面java.util.Arrays.binarySearch用的是3-branch版本,可能在数量不是很大(少于2k)有优势,或有其它考虑

    递归,看上去还行,但易于编写,不易出错。很好

    代码,把结果画成比较好看的柱状图

    ps:这个周末除了和朋友聚聚,为女朋友的侄子采购玩具,剩下就是折腾binary search了。

  2. 01 Jun 2013

    python语言,binary search和dict对比

    今天是国际儿童节。 跟我没有什么关系。 祝他们都能开心,快乐的成长。

    dict可以用来判断一个元素是不是在集合里面,binary search也可以。一直想对比一下他们的性能:

    1. 查找,hash是O(1), binary search是O(logn)
    2. 由于array内存非常紧凑,可能内存访问会少一些,它还更省内存
    3. 我写过几个程序,array的数据是通过mmap得到,是别的程序排好序,写入文件的
    4. 有些语言,如C,标准库没有dict(map)的实现
    5. 个数从2k到200k,1500次查询,750个能查到,另外750个可能查到

    测试代码

    测试代码在。其中binary search的实现为:

    def binary_search(arr, key):
        lo, hi = 0, (len(arr) - 1)
        while lo <= hi:
            mid = (lo + hi) >> 1
            k = arr[mid]
            if k == key:
                return mid
            elif k < key:
                lo = mid + 1 # search upper subarray
            else:
                hi = mid - 1 # search lower subarray
                # (-(insertion point) - 1)
        return -(lo + 1)

    随机生成测试数据的代码

    COUNT = 1000 * 200
    
    # key is string
    key_str = {}
    for i in range(COUNT):
        key_str[random_string(random.randint(2, 15))] = True
    
    # key is int
    key_int = {}
    for i in range(COUNT):
        key_int[random.randint(0, 1 << 32)] = True

    长度为2到15的随机字符串

    string

    随机int,array为list

    string

    随机int,用array module

    string

    结论

    1. dict完胜binary search,当数据仅有2000个时,依然
    2. 随着item个数的增加,dict性能几乎不变:O(1)不是说着玩的
    3. array没有list效率高,倒是出乎意料。文档说它很快,binary search时,是随机访问
    4. binary search慢,有可能是python慢,dict是C实现的,并且是使劲优化过的
    5. int和string没有明显的性能差别
    6. 需要测试一下其它语言,如go,java,c。再找时间了
  3. 19 Mar 2013

    浅议如何高效实现线程池中部分task有序执行

    线程池接受task,并执行他们。

    T1
    T2 *
    T3
    T4
    T5 *
    T6

    不做特殊处理,线程池不能保证T1-T6的执行顺序,有可能它们是并发的。但有特殊要求,比如需要T2和T5需要顺序执行,应该怎么实现呢?

    问题来源于,在写http-kit的时候,我对它加入了WebSocket协议的支持。 处理WebSocket时,由于一个client可能发多个message给服务器,服务器应该需要保证它们的时序:先到先被处理,后到的要等待前面的被处理后才能被处理。

    http-kit的线程模型是一个异步的IO线程,负责接受客户端的连接,读取并解析协议为request,扔到一个线程池,由它们计算response,然后IO线程负责把response写到客户端的TCP socket buffer。 http-kit需要处理HTTP和WebSocket协议,HTTP协议是单一的request => response,不实现Pipeline则没有这个问题(http-kit选择不实现HTTP的pipeline)。

    问题变为:如何高效实现线程池中部分task有序执行

    简单并高效的代码总是很招人喜欢,我甚至愿意用一点点效率来换取简单。http-kit用3000来行代码,从零高效实现HTTP server,支持HTTP长连推送和Websocket,异步的HTTP client,Timer,以及漂亮的API。实际经验告诉我,简单也会换来性能,比如http-kit具有和Nginx相似的性能特点,这里有一南非朋友做的测试,这里有一个C600K的试验,被推上了Hacker News的首页首行。

    最先的考虑是需要在线程池上做文章。试着去实现一个特殊的线程池:多个线程,共享一个全局queue,每个还有一个单独的queue,通过单独的queue保证顺序。因为2个queue,需要解决Task被饿死的情况。 试着写了几个实现,总觉不太对,感觉自己能力有限,试着发了封邮件到,那里面有很多这方面的专家,比如大名鼎鼎的j.u.c的作者Doug Lea,希望能得到点指点。第二天,有几个人给了回复,其中Jeff Hain的回复让我茅塞顿开:

    It seems to me you don’t need a particular thread pool to have some of your tasks ordered, if you can keep a reference to last submitted runnable for each ordered sequence of runnables (for example using a map in IO thread).

    I’m thinking about using a specific “LinkingRunnable”, containing an “impl” Runnable and an AtomicReference to a “next” Runnable.

    When LinkingRunnable.run() is called, it first runs impl.run(), and then does CAS(null,this) (using “this” as a tombstone): if CAS fails, that means someone added a runnable to run next, and it runs it.

    When subsequent task arrives, you try to “enqueue” it doing CAS(null,next): If CAS succeeds, that means it’ll be executed just after previous task (and in same thread), and if it fails that means that the previous task already completed, so you can just submit your next task to the pool.

    You can also test “get() == null” before doing CASes, which should make them rare, and reduce the usual overhead to a volatile read.

    我立刻做了实现:

    class LinkingRunnable implements Runnable {
        private final Runnable impl;
        AtomicReference<LinkingRunnable> next = new AtomicReference<LinkingRunnable>(null);
    
        public LinkingRunnable(Runnable r) {
            this.impl = r;
        }
    
        public void run() {
            impl.run();
            if (!next.compareAndSet(null, this)) { // has more job to run
                next.get().run();
            }
        }
    }
    
    WSFrameHandler task = new WSFrameHandler(channel, frame);
    LinkingRunnable job = new LinkingRunnable(task);
    // channel.serialTask 为一个指针,先取出原来的,并更新
    LinkingRunnable old = channel.serialTask;
    channel.serialTask = job;
    
    if (old == null) { // No previous job
        execs.submit(job);
    } else {
        if (old.next.compareAndSet(null, job)) {
            // successfully append to previous task
        } else {
            // previous message is handled, order is guaranteed.
            execs.submit(job);
        }
    }

    更详细的代码,以及上下文,请参看

    简单高效的实现了局部有序,非常喜欢,符合我对好代码的要求:简单,高效

  4. 10 Mar 2013

    Unified API for WebSocket, HTTP streaming and long-polling (WIP)

    I and Peter were working on a new unified API for WebSocket, HTTP streaming and long-polling with protocol Channel and macro with-channel.

    The new API is fully working now. I just get the time to write a page to explain it a little bit: http://http-kit.org/channel.html.

    It has a few benefits:

    • unified, simpler
    • more feature: support HTTP streaming
    • Sever also get notified when client close/away for long-polling/streaming

    If interested, please have a try, and tell me how do you think.

  5. 28 Jan 2013

    600k concurrent HTTP connections, with Clojure & http-kit

    Original: http://http-kit.org/600k-concurrent-connection-http-kit.html

    Edit(2013/1/29): http-kit is a event driven HTTP server & client for Clojure. It’s open source, https://github.com/http-kit

    Inspired by Scaling node.js to 100k concurrent connections! and Node.js w/250k concurrent connections!. I did some test for http-kit!

    http-kit manages to make 600k concurrent connections, on PC!

    Server’s logic

    The server read the length param from the request, generate a string of that length.

    ;; main.clj
    ;; ~20k string
    (def const-str (apply str (repeat 200 "http-kit is a http server & client
       written from scrach for high performance clojure web applications,
       support async and websocket")))
    
    (defn handler [req]
      (let [length (to-int (or (-> req :params :length) 1024))]
        {:status 200
         :headers {"Content-Type" "text/plain"}
         :body (subs const-str 0 (max (min 10240 length) 1))}))
    
    (defn -main [& args]
      (run-server (-> handler wrap-keyword-params wrap-params)
                  {:port 8000})
      (println (str "Server started. listen at 0.0.0.0")))

    Start the server:

    java -server -Xms3072m -Xmx3072m -cp `lein classpath` clojure.main -m main

    Linux config

    The server need to set max allowed open file to a much larger value. The default value is ~1024

    echo 9999999 | sudo tee /proc/sys/fs/nr_open
    echo 9999999 | sudo tee /proc/sys/fs/file-max
    
    # edit /etc/security/limits.conf, add the following line, need logout and login again
    * - nofile 4999999
    
    # set before run the server and test code
    ulimit -n 4999999

    More ports for test code to use

    sudo sysctl -w net.ipv4.ip_local_port_range="1025 65535"

    Hardware & Software

    The server and test code are both run on my desktop:

    • CPU: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz, 4 core, 8 threads
    • RAM: 16G @ 1333MHZ
    • OS : Linux 3.2.0-2-amd64 #1 SMP Sun Apr 15 16:47:38 UTC 2012 x86_64 GNU/Linux
    • http-kit: “2.0-rc1”
    • JVM: 1.7.0_04

    How to make 600k concurrent connections on a single PC

    A IP can only issue most 65536 connections to a server, since socket port is unsigned short. But we can bypass this limit. On Linux, it’s quite easy to set up virtual network interface:

    for i in `seq 200 230`; do sudo ifconfig eth0:$i 192.168.1.$i up ; done

    Then your computer have many IPs, from 192.168.1.200 to 192.168.1.230. The server bind to 0.0.0.0, the client can connect 192.168.1.200, or 192.168.1.201, etc. Per IP can have about 60K concurrent connections. Then the client can issue as many concurrent connections as it needed.

    Concurrency test code

    The client opens 600k concurrent keep-alived connections to the server, request the server to return string of length randomly between 1 ~ 4096 bytes, read the response, idle 5s ~ 45s (randomly pick a value between), request again.

    output:

    time 0s, concurrency: 100, total requests: 0, thoughput: 0.00M/s, 0.00 requests/seconds
    time 40s, concurrency: 164000, total requests: 230142, thoughput: 11.78M/s, 5688.28 requests/seconds
    ...
    time 89s, concurrency: 340100, total requests: 788985, thoughput: 18.23M/s, 8812.23 requests/seconds
    ...
    time 179s, concurrency: 595166, total requests: 2483174, thoughput: 28.61M/s, 13837.77 requests/seconds
    time 180s, concurrency: 597853, total requests: 2506378, thoughput: 28.71M/s, 13888.67 requests/seconds
    time 183s, concurrency: 600000, total requests: 2529020, thoughput: 28.52M/s, 13788.14 requests/seconds
    time 185s, concurrency: 600000, total requests: 2537212, thoughput: 28.20M/s, 13680.42 requests/seconds
    ...
    time 930s, concurrency: 600000, total requests: 17457773, thoughput: 38.64M/s, 18763.53 requests/seconds
    time 931s, concurrency: 600000, total requests: 17477678, thoughput: 38.69M/s, 18764.73 requests/seconds

    How about ab test when 600k connections are kept

    Issue this command from command line:

    ab -n 100000 -c 10 -k http://127.0.0.1:8000/

    output: ab output

    Server Software:        http-kit
    Server Hostname:        127.0.0.1
    Server Port:            8000
    
    Document Path:          /
    Document Length:        1024 bytes
    
    Concurrency Level:      10
    Time taken for tests:   3.184 seconds
    Complete requests:      100000
    Failed requests:        0
    Write errors:           0
    Keep-Alive requests:    100000
    Total transferred:      117000000 bytes
    HTML transferred:       102400000 bytes
    Requests per second:    31405.53 [#/sec] (mean)
    Time per request:       0.318 [ms] (mean)
    Time per request:       0.032 [ms] (mean, across all concurrent requests)
    Transfer rate:          35883.27 [Kbytes/sec] received
    
    Connection Times (ms)
                  min  mean[+/-sd] median   max
    Connect:        0    0   0.0      0       0
    Processing:     0    0   9.3      0     913
    Waiting:        0    0   9.3      0     913
    Total:          0    0   9.3      0     913
    
    Percentage of the requests served within a certain time (ms)
      50%      0
      66%      0
      75%      0
      80%      0
      90%      0
      95%      0
      98%      0
      99%      0
     100%    913 (longest request)

    The Clojure Server’s CPU usage

    jvisualvm’s snapshot file.

    cpu usage

    The Clojure Server’s heap usage

    heap memory usage

    Run it yourself

    The complete test code is available on github. Checkout and run it yourself!

    To report a bug, or general discussion: https://github.com/http-kit/scale-clojure-web-app/issues