Blog Harvester

Harvesting

Apropos

The Harvester is made of fresh flesh from the blogosphere, clean design by Tigion and hacked-together Ruby scripts by Astro, improved by Neingeist and Josef.

To be fed

Just blogs: ATOM 1.0 or RSS 2.0,

All collections: ATOM 1.0 or RSS 2.0

Be notified

Start Jabber and talk to AstroBot to receive the freshest of all!

Questions

Missing something?
Contact Astro via Mail or Jabber: astro@spaceboyz.net

Hack it…

Old SVN but now on GitHub.

…or learn more

Project info, Repository, Trac, Source browser

Other Harvesters

blacksec.org

Planet CCC

Planet Entropia (CCC Karlsruhe)

unter schwarzer flagge

Recent Commits to harvester:blog-harvester.de

Photos

Delicious/Alien8

Delicious/astro1138

Delicious/boelthorn

Delicious/cosmoFlash

Delicious/DerTobendeGummihammer

Delicious/fukami

Delicious/mechko

Delicious/pentabarf

Delicious/pq3x10

Delicious/r0b0

Delicious/rabuju

Delicious/Shnifti

Delicious/stepardo

Delicious/the_roadrunner

Delicious/tigion

Delicious/toidinamai

Delicious/turbo24prg

Microblogging — today

Microblogging — yesterday

Microblogging — past

Inductive Bias

JAX: Logging best practices

The ideal outcome of Peter Roßbach’s talk on logging best practices was to have
attendees leave the room thinking “we know all this already and are applying
it successfully” - most likely though the majority left thinking about how to
implement even the most basic advise discussed.


From his consultancy and fire fighter background he has a good overview of what
logging in the average corporate environment looks like: No logging plan, no
rules, dozens of logging frameworks in active use, output in many different
languages, no structured log events but a myriad of different quoting,
formatting and bracketing standards instead.


So what should the ideal log line contain? First of all it should really be a
log line instead of a multi line something that cannot be reconstructed when
interleaved with other messages. The line should not only contain the class
name that logged the information (actually that is the least important piece of
information), it should contain the thread id, server name, a (standardised and
always consistently formatted) timestamp in a decent resolution (hint: one new
timestamp per second is not helpful when facing several hundred requests per
second). Make sure to have timing aligned across machines if timestamps are
needed for correlating logs. Ideally there should be context in the form of
request id, flow id, session id.


When thinking about logs, do not think too much about human readability - think
more in terms of machine readability and parsability. Treat your logging system
as the db in your data center that has to deal with most traffic. It is what
holds user interactions and system metrics that can be used as business
metrics, for debugging performance problems, for digging up functional issues.
Most likely you will want to turn free text that provides lots of flexibility
for screwing up into a more structured format like json, or even some binary
format that is storage efficient (think protocol buffers, thrift, avro).


In terms of log levels, make sure to log development traces on trace, provide
detailed problem analysis stuff on debug, put normal behaviour onto info. In
case of degraded functionality, log to warn. In case of things you cannot
easily recovered from put them on error. When it comes to logging hierarchies -
do not only think in class hierarchies but also in terms of use cases: Just
because your http connector is used in two modules doesn’t mean that there
should be no way to turn logging on just for one of the modules alone.


When designing your logging make sure to talk to all stakeholders to get clear
requirements. Make sure you can find out how the system is being used in the
wild, be able to quantify the number of exceptions; max, min and average
duration of a request and similar metrics.


Tools you could look at for help include but are not limited to splunk, jmx,
jconsole, syslog, logstash, statd, redis for log collection and queuing.


As a parting exercise: Look at all of your own logfiles and count the different
formats used for storing time.

Inductive Bias

JAX: Java performance myths

This talk was one of the famous talks on Java performance myths by Arno Haase.
His main point - supported with dozens of illustrative examples was for
software developers to stop trusting in word of mouth, cargo cult like myths
that are abundant among engineers. Again the goal should be to write readable
code above all - for one the Java compiler and JIT are great at optimising. In
addition many of the myths being spread in the Java community that are claimed
to lead to better performance are simply not true.


It was interesting to learn how many different aspects of both software and
hardware contribute to code performance. Micro benchmarks are considered
dangerous for a reason - creating a well controlled environment that matches
what the code will encounter in production is influenced by things like just in
time compilation, cpu throttling, etc.


Some myths that Arno proved wrong include final making code faster (in case of
method parameters it doesn’t make a difference up to bytecode being identical
with and without), inheritance being always expensive (even with an abstract
class between the interface and the implementation Java 6 and 7 can still
inline the method in question). Another one was on often wrongly scoped Java
vs. C comparisons. One myth resolved around the creation of temporary objects -
since Java 6 and 7 in simple cases even these can be optimised away.


When it comes to (un-)boxing and reflection there is a performance penalty. For
the latter mostly for method lookup, not so much for calling the method. What we
are talking about however are penalties in the range of about 1000 compute
cycles. Compared to doing any remote calls this is still dwarfed. Reflection on
fields is even cheaper.


One of the more wide spread myths resolved around string concatenation being
expensive - doing a “A” + “B” in code will be turned into “AB” in
bytecode. Even doing the same with a variable will be turned into the use of
StringBuilder ever since -XX:OptimizeStringConcat was turned on by default.


The main message here is to stop trusting your intuition when reasoning about a
system’s performance and performance bottlenecks. Instead the goal should be to
go and measure what is really going on. Those are simple examples where your
average Java intuition goes wrong. Make sure to stay on top with what the JVM
turns your code into and how that is than executed on the hardware you have
rolled out if you really want to get the last bit of speed out of your
application.

genius' blog

COO

Chief Of Outlook ;)

Inductive Bias

JAX: Does parallel equal performant?

In general there is a tendency to set parallel implementations to being equal
to performant implementations. Except in the really naive case there is always
going to be some overhead due to scheduling work, managing memory sharing and
network communication overhead. Essentially that knowledge is reflected in
Amdahl’s law (the amount of serial work limits the benefit from running parts
of your implementation in parallel, http://en.wikipedia.org/wiki/Amdahl’s_law),
and Little’s law (http://en.wikipedia.org/wiki/Little’s_law) in case of queuing
problems.


When looking at current Java optimisations there is quite a bit going on to
support better parallelisation: Work is being done to provide for improving
lock contention situations, the GC adaptive sizing policy has been improved to
a usable state, there is added support for parallel arrays and lampbda’s
splitable interface.


When it comes to better locking optimisations what is most notable is work
towards coarsening locks at compile and JIT time (essentially moving locks from
the inside of a loop to the outside); eliminating locks if objects are being
used in a local, non-threaded context anyway; and support for biased locking
(that is forcing locks only when a second thread is trying to access an
object). All three taken together can lead to performance improvements that
will almost render StringBuffer and StringBuilder to exhibit equal performance
in a single threaded context.


For pieces of code that suffer from false sharing (two variables used in
separate threads independently that end up in the same CPU cacheline and as a
result are both flushed on update) there is a new annotation: Adding the
“@contended” annotation can help the compiler for which pieces of code to add
cacheline padding (or re-arrange entirely) to avoid that false sharing from
happening. One other way to avoid false sharing seems to be to look for class
cohesion - coherent classes where methods and variables are closely related
tend to suffer less from false sharing. If you would like to view the resulting
layout use the “-XX:PrintFieldLayout” option.


Java 8 will bring a few more notable improvements including changes to the
adaptive sizing GC policy, the introduction of parallel arrays that allow for
parallel execution of predicates on array entries, changes to the concurrency
libraries, internalised iterators.

Inductive Bias

JAX: Pigs, snakes and deaths by 1k cuts

In his talk on performance problems Rainer Schuppe gave a great introduction to
which kinds of performance problems can be observed in production and how to
best root-cause them.


Simply put performance issues usually arise due to a difference in either data
volumn, concurrency levels or resource usage between the dev, qa and production
environments. The tooling to uncover and explain them is pretty well known:
Staring with looking at logfiles, ARM tools, using aspects, bytecode
instrumentalisation, sampling, watching JMX statistics, and PMI tools.


All of theses tools have their own unique advantages and disadvantages. With
logs you get the most freedom, however you have to know what to log at
development time. In addition logging is i/o heavy, so doing too much can slow
the application down itself. In a common distributed system logs need to be
aggregated somehow. As a simple example of what can go wrong are cascading
exceptions spilled to disk that cause machines to run out of disk space one
after the other. When relying on logging make sure to keep transaction
contexts, in particular transaction ids across machines and services to
correlate outages. In terms of tool support, look at scribe, splunk and flume.


A tool often used for tracking down performance issues in development is the
well known profiler. Usually it creates lots of very detailed data. However it
is most valuable in development - in production profiling a complete server
stack produces way too much load and data to be feasable. In addition there’s
usually no transaction context available for correlation again.


A third way of watching applications do their work is to watch via JMX. This
capability is built in for any Java application, in particular for servlet
containers. Again there is not transaction context. Unless you take care of it
there won’t be any historic data.


When it comes to diagnosing problems, you are essentially left with fixing
either the “it does not work” case or the “it is slow case”.


For the “it is slow case” there are a few incarnations:


  • It was always slow, we got used to it.

  • It gets slow over time.

  • It gets slower exponentially.

  • It suddenly gets slow.

  • There is a spontanous crash.


  • In the case of “it does not work” you are left with the following observations:


    • Sudden outages.

    • Always flaky.

    • Sporadic error messages.

    • Silent death.

    • Increasing error rates.

    • Misleading error messages.


    • In the end you will always be spinning in a Look at symptoms, Elimnate
      non-causes, Identifiy suspects, Confirm and Eliminate comparing to normal. If
      not done with that, leather, rinse, repeat. When it comes to causes for errors
      and slowness you will usually will run into one of the following causes: In
      many cases bad coding practices are a problem, too much load, missing backends,
      resource conflicts, memory and resource leakage as well as hardware/networking
      issues are causes.


      Some symptoms you may observe include foreseeable lock ups (it’s always slow
      after four hours, so we just reboot automatically before that), consistent
      slowness, sporadic errors (it always happens after a certain request came in),
      getting slow and slower (most likely leaking resources), sudden chaos (e.g.
      someone pulling the plug or someone removing a hard disk), and high utilisation
      of resources.

      Linear memory leak

      In case of a linear memory leak, the application usually runs into an OOM
      eventually, getting ever slower before that due to GC pressure. Reasons could
      be linear structures being filled but never emptied. What you observe are
      growing heap utilisation and growing GC times. In order to find such leakage
      make sure to turn on verbose GC logging, do heapdumps to find leaks. One
      challenge though: It may be hard to find the leakage if the problem is not one
      large object, but many, many small ones that lead to a death by 1000 cuts
      bleeding the application to death.


      In development and testing you will do heap comparisons. Keep in mind that
      taking a heap dump causes the JVM to stop. You can use common profilers to look
      at the heap dump. There are variants that help with automatic leak detection.


      A variant is the pig in a python issue where sudden unusually large objects
      cause the application to be overloaded.


      Resource leaks and conflicts


      Another common problem is leaking resources other than memory - not closing
      file handles can be one incarnation. Those problems cause a slowness over time,
      they may lead to having the heap grow over time - usually that is not the most
      visible problem though. If instance tracking does not help here, your last
      resort should be doing code audits.


      In case of conflicting resource usage you usually face code that was developed
      with overly cautious locking and data integrity constraints. The way to go are
      threaddumps to uncover threads in block and wait states.


      Bad coding practices


      When it comes to bad coding practices what is usually seen is code in endless
      loops (easy to see in thread dumps), cpu bound computations where no result
      caching is done. Also layeritis with too much (de-)serialisation can be a
      problem. In addition there is a general “the ORM will save us all” problem that
      may lead to massive SQL statements, or to using the wrong data fetch strategy.
      When it comes to caching - if caches are too large, access times of course grow
      as well. There could be never ending retry loops, ever blocking networking
      calls. Also people tend to catch exceptions but not do anything about them
      other than adding a little #fixme annotation to the code.


      When it comes to locking you might run into dead-/live-lock problems. There
      could be chokepoints (resources that all threads need for each processing
      chain). In a thread dump you will typically see lots of wait instead of block
      time.


      In addition there could be internal and external bottlenecks. In particular
      keep those in mind when dealing with databases.


      The goal should be to find an optimum for your application between too many too
      small requests that waste resources getting dispatched, and one huge request
      that everyone else is waiting for.

Inductive Bias

JAX: Java HPC by Norman Maurer

For slides see also: Speakerdeck: High performance networking on the JVM


Norman started his talk clarifying what he means by high scale: Anything above
1000 concurrent connections in his talk are considered high scale, anything
below 100 concurrent connections is fine to be handled with threads and blocking
IO. Before tuning anything, make sure to measure if you have any problem at
all: Readability should always go before optimisation.


He gave a few pointers as to where to look for optimisations: Get started by
studying the socket options - TCP-NO-DELAY as well as the send and receive
buffer sizes are most interesting. When under GC pressure (check the GC locks
to figure out if you are) make sure to minimise allocation and deallocation of
objects. In order to do that consider making objects static and final where
possible. Make sure to use CMS or G1 for garbage collection in order to
maximise throughput. Size areas in the JVM heap according to your access
patterns. The goal should always be to minimise the chance of running into a
stop the world garbage collection.


When it comes to using buffers you have the choice of using direct or heap
buffers. While the former are expensive to create, the latter come with the
cost of being zero’ed out. Often people start buffer pooling, potentially
initialising the pool in a lazy manner. In order to avoid memory fragmentation
in the Java heap, it can be a good idea to create the buffer at startup time
and re-use it later on.


In particular when parsing structured messages like they are common in
protocols it usually makes sense to use gathering writes and scattering reads
to minimise the number of system calls for reading and writing. Also try to
buffer more if you want to minimise system calls. Use slice and duplicate to
create views on your buffers to avoid mem copies. Use a file channel when
copying files without modifications.


Make sure you do not block - think of DNS servers being unavailable or slow as
an example.


As a parting note, make sure to define and document your threading model. It
may ease development to know that some objects will always only be used in a
single threaded context. It usually helps to reduce context switches as well as
may ease development to know that some objects will always only be used in a
single threaded context. It usually helps to reduce context switches as well as
keeping data in the same thread to avoid having to use synchronisation and the
use of volatile.


Also make a conscious decision about which protocol you would like to use for
transport - in addition to tcp there’s also udp, udt, sctp. Use pipelining in
order to parallelise.

The Turkey Curse

Hunde und Katzen essen‎

Letzte Woche war ich seit langer Zeit mal wieder unterwegs im Rheinland (z.B. zur Vorbereitung der SIGINT, verschiedene Treffen im Kontext des Transparenzgesetzes NRW, HV der Drosselkom u.a.), und es gab dabei eine Reihe seltsamer Eindrücke, über die sich gar nicht so einfach schreiben lässt. Einige Sachen muss ich dennoch loswerden, auch wenn sie missverständlich oder gar als Angriff ankommen mögen, als das sie nicht gemeint sind.

Im NRW-Landtag in Düsseldorf fand auf Einladung der Landesregierung die Veranstaltung Zukunftsforum “Digitale Bürgerbeteiligung” - Open Government und Open Parliament in NRW statt. Um es gleich klar machen: Ich fand die Veranstaltung im Kern ganz gut und weiss durchaus sehr zu schätzen, was die Landtagsverwaltung NRW auf die Beine gestellt hat. Allerdings hoffe ich, dass bei weiteren Veranstaltungen dieser oder ähnlicher Art die Organisatoren im Detail etwas mehr Fingerspitzengefühl, Humor und Mut entwickeln. Meines Erachtens entspräche das Format wohl eher einem klassischen Barcamp, das etwas mehr Spontanität zugelassen hätte. Trotzdem muss ich sagen, dass ich es als sehr viel offener empfunden habe als erwartet.

Aber wie schon angedeutet gibt es einige Anmerkungen, die ich mir einfach nicht verkneifen kann.

Das Programm startete mit den Eröffnungsreden von Landtagspräsidentin Carina Gödecke und Ministerpräsidentin Hannelore Kraft. Danach ging es direkt weiter mit dem ersten Panel — und was für einem: Dort standen 10 (in Worten zehn) Männer in gleichem Aufzug einer (in Worten: einer) Frau gegenüber. Diese Frau wurde zudem mit den Worten begrüßt: “Nun kommen wir zu unserer einzigen Frau in der Runde. Dafür hat sie aber einen schönen Namen”. Hier ein Screenshot von diesem Teil der Veranstaltung, der durchaus etwas ikonenhaftes hat wie ich finde.

Eröffnungspanel #opennrw

Anmerkung: Der Screenshot ist von dem Video des Panels. Ein anderes Foto besserer Qualität, dass ich auf Grund der Lizenz nicht einbinden kann, findet sich bei Flickr.

SRSLY? Im Jahre 2013 findet ein Event zum Thema “Digitale Bürgerbeteiligung” statt und die Veranstalter stellen dort allen Ernstes 10 Kerle und eine Frau auf die Bühne? Das ist irgendwie etwas zuviel Postgender für meinen Geschmack, und überhaupt: Dass eine Veranstaltung zu so einem Thema sogar einen geringeren Frauenanteil aufweist als die üblichen Nerdkonferenzen oder ein durchschnittlicher Parteitag der Piratenpartei, sollte sehr zu denken geben.

Bemerkenswert war auf dem Panel übrigens Innenminister Jäger (dritter von links auf obigem Bild) mit dem Spruch “Ich finde, Open Government hat nichts damit zu tun, dem Bürger terabyteweise Daten zuzuschieben”. Doch, lieber Herr Jäger, genau das hat es. Dass der zuletzt durch seinen besonderen Einsatz für die Bestandsdatenauskunft und auch sonst nicht grade als Freund bürgerrechtsfreundlicher Politik bekannte SPD-Minister solche Sachen auf einem Event dieser Art von sich gibt ist schon irgendwie bitter, zeigt es doch, wie wenig ihn das Thema ganz offensichtlich interessiert, sonst hätte er mitbekommen, worum es geht. Sehr schräg war dann auch, dass auf dem Panel über Mails von Mitarbeitern diskutiert wurde, die echt niemanden interessieren, denn darum geht es in der Debatte um Offene Daten nicht und ging es auch nie, aber total vom Wesentlichen ablenken.

Im Vorfeld der Veranstaltung wurde der Hashtag #opennrw für Twitter öffentlich auf den entsprechenden Seite verkündet und sogar Broschüren damit gedruckt, woraufhin jemand ein kleines Script geschrieben hat, das Katzenbilder von Google geholt und mit dem entsprechenden Hashtag versehen konstant auf Twitter postete.

Auf dem Event wurde zur Eröffnung verkündet, der Hashtag werde nun auf #opennrw13 geändert und sogleich bekam ich hinter mir ein Telefonat mit, in dem die Worte fielen “Hey, die haben den Hashtag geändert. Hol die Katzen raus!” - wohl in Anlehnung an “Bring out the KRAKEN”.

Cat Attack auf #opennrw und #opennrw13

Symbolbild: Anhaltender Cyberangriff von Katz3n auf #opennrw und #opennrw13

Es gab auf dem Event natürlich auch wenig überraschend “Twitterwalls”, bei denen irgendwann “katzen” und “katz3n” gefiltert wurden:

#opennrw ohne katzen

OpenNRW nur ohne Katzen: Im Landtag NRW gibt es offensichtlich eine gewisse Katzenfeindlichkeit

Für mich bringt das prinzipielle Defizite im Umgang mit digitaler Öffentlichkeit sehr gut auf den Punkt, und nicht zuletzt das war ja wohl auch Sinn der Übung wie ich das einschätze. Zumindest aber die Landeszentrale für politische Bildung NRW scheint es mit Humor genommen zu haben und twitterte “Kann einer mal die katze füttern!”. Es sollte tatsächlich einfach als das betrachtet werden, was es ist: Ein vielleicht etwas schräges, aber durchaus herzliches und freundliches Willkommen, ein “wir werden noch viel Spass haben, wenn ihr ein wenig den Stock aus dem Hintern nehmt” und eine Einladung, weiter auf Augenhöhe miteinander zu reden (ja, auch auf Augenhöhe der Katzen, aber den Witz kann keiner kapieren, der nicht in einer konkreten Situation dabei war ^^).

Aber ehrlich: Ich vermute, dass daraus die falschen Schlüsse gezogen werden — was sich ja schon daran zeigt, dass Urheber dieser Aktion anwesend waren (nein, ich war das nicht!), aber weder IRL noch auf Twitter wirklich eine direkte Ansprache stattfand, auf die diese ganz sicher reagiert hätten. Denn das war (und ist) kein anonymer Porno-, Malware- oder Linkspam, sondern freundlich dreinblickende Katzen (und, zugegeben, das eine oder andere Pony, das sich dort eingeschlichen zu haben scheint).

Zum Abschluss gab es — Tusch — ein weiteres Panel, dieses Mal mit nur fünf Männern und einer Frau. Schade eigentlich, waren die Workshops im Laufe des Tages doch meist von gutem Niveau, und dieses Panel war für den Abschluss zu schwach.

Auch wenn das jetzt etwas merkwürdig rüberkommen mag, dass ich eine Kritik ausgerechnet an der einzigen Frau in der Gruppe richte, muss ich sie dennoch loswerden.

Einmal mehr irritierte mich das Gov2.0-Netzwerk: Deren Vertreterin auf dem Panel ist nicht nur im Vorstand des Vereins, sondern auch Mitarbeiterin bei Dataport, dem Dienstleister der Verwaltungen in den Nordländern. Sie betonte zwar in der Vergangenheit mir gegenüber schon öfter, dort nur als Pressesprecherin zu arbeiten und bezeichnet sich als Journalistin (ich definiere das Wesen dieses Begriff anders, aber das nur am Rande). Angaben zu diesem Engagement findet sich aber weder auf der Webseite der Veranstaltung, noch wurde darauf bei dem Panel hingewiesen. Das wirkt ähnlich wie bei dem Blogpost Die GovData-Entrüstung…ein Bärendienst? damals, bei der der Autor der Kritik an der unabhängige Open Data/Open Government-Szene ebenfalls “vergaß” klarzustellen, was sein persönlicher Kontext ist und wie es in solchen Fällen üblich sein sollte: Er ist NTO (National Technology Officer) bei Microsoft und war ehemals bei CSC, die — Funfact am Rande und totally unrelated — jetzt grade einen Funktionstest des Staatstrojaner durchführen soll. Einen NGO als einzige Referenz zu benutzen, während man gleichzeitig in Unternehmen arbeitet, die in dem Bereich tätig sind, den man vertritt, ist mehr nur ein bisschen bemerkenswert und hatte ich auch schon während meines re:publica-Vortrages mit Lorenz thematisiert.

Grade innerhalb dieser Szenerie, die sich der Öffnung politischer und verfahrenstechnischer Prozesse verschrieben hat, ist Transparenz von ganz besonderer Bedeutung — auch wenn das die Beteiligten ganz anders sehen und sich z.B. bei Facebook im Nachgang des erwähnten Blogposts darüber mokieren, diese Klarstellungen empfänden sie als störend (Sorry, ich habe keinen Facebook-Account mehr um darauf zu verlinken). Die Entscheidung und Bewertung darüber obliegt ihnen meiner Ansicht nach aber gar nicht (was ich auch schon öfter betont habe). Wenn sie also über den vielbeschworenen “Kulturwandel” reden wollen, ist das eben auch ein Teil dieser neuen Kultur: Klare Ansagen bei möglichen Interessenskonflikten, wie sie ganz offensichtlich existieren. Die ergeben sich automatisch durch Arbeitsverträge, und schränken die Kritikfähigkeit und -möglichkeit ganz erheblich ein. Das zeigt sich dann ja auch in eher nichtssagenden Statements auf diesem Panel, wo es durchaus noch einiges zu sagen gegeben hätte. Es sei aber auch noch einmal klar gestellt, dass es nicht als fachliche Kritik gemeint ist, und ich halte die Personen durchaus für kompetent.

Im Chaosdorf wurde der Abend beendet mit “Freitagsfoo” genannten Kurzvorträgen zu ZFS (passend im Zusammenhang der “terabyteweisen Daten”), Arbeitsschutz, DNS, Ideen für eigene Verschlüsselung sowie anschliessendem Konsum von Barbarella und Sachen wie Smells Like Humppa von Eläkeläiset, was speziell an diesem Tag ganz besonders gut zum Ausdruck gebracht hat, wie ich einiges an diesem Tag empfunden habe.

Als ich dann irgendwann spät nachts nach “opennrw” bei Google gesucht habe, bekam ich als Antwort folgende Seite, dessen Werbeblock an Ende der Seite mich zu dem Titel des Textes inspirierte und mich praktisch dazu zwang, das Ganze kurz niederzuschreiben. Es ist meines Erachtens wichtig sich klar zu machen, was öffentlicher Raum im Netz momentan bedeutet und wie weit das von dem entfernt ist, was es sein sollte. Oder wie ich in meiner Kirchentagsrede sagte: “Der öffentliche Raum, über den wir hier reden, ist eher mit einem Kaufhaus zu vergleichen, in dem wir uns treffen und austauschen. Niemand würde das in der Realität ernsthaft als öffentlichen Raum in dem Sinne begreifen, wie wir ihn sonst ganz selbstverständlich wahrnehmen, sondern als das was es ist: Ein privater Raum mit öffentlicher Begängnis”. So verwundert es eben auch nicht, womit der Begriff “OpenNRW” aus Sicht der Werbenden zusammenhängt.

Hunde und Katzen essen!

Mal davon abgesehen, dass ich gelernt habe, in Schweiz sei es völlig normal, Hunde und Katzen zu essen: Erlebnisse an Tagen wie diesen sind es, warum ich dieses Internet einfach von ganzem Herzen liebe!

Inductive Bias

JAX: Hadoop overview by Bernd Fondermann


After breakfast was over the first day started with a talk by Bernd on the
Hadoop ecosystem. He did a good job selecting the most important and
interesting projects related to storing data in HDFS and processing it with Map
Reduce. After the usual “what is Hadoop”, “what does the general architecture
look like”, “what will change with YARN” Bernd gave a nice overview of which
publications each of the relevant projects rely on:


  • HDFS is mainly based on the paper on GFS.

  • Map Reduce comes with it’s own publication.

  • The big table paper mainly inspired Cassandra (to some extend), HBase,
    Accumulo and Hypertable.

  • Protocol Buffers inspired Avro and Thrift, and is available as free
    software itself.

  • Dremel (the storage side of things) inspired Parquet.

  • The query language side of Dremel inspired Drill and Impala.

  • Power Drill might inspire Drill.

  • Pregel (a graph database) inspired Giraph.

  • Percolator provided some inspiration to HBase.

  • Dynamo by Amazon kicked of Cassandra and others.

  • Chubby inspired Zookeeper, both are based on Paxos.

  • On top of Map Reduce today there are tons of higher level languages,
    starting with Sawzall inside of Google, continuing with Pig and Hive at Apache
    we are now left with added languages like Cascading, Cascalog, Scalding and
    many more.

  • There are many other interesting publications (Megastore, Spanner, F1 to
    name just a few) for which there is no free implementation yet. In addition
    with Storm, Hana and Haystack there are implementations lacking canonical
    publications.



  • After this really broad clarification of names and terms used, Bernd went into
    some more detail on how Zookeeper is being used for defining the namenode in
    Hadoop 2, how high availablility and federation works for namenodes. In
    addition he gave a clear explanation of how block reports work on cluster
    bootup. The remainder of the talk was reserved for giving an intro to HBase,
    Giraph and Drill.

Inductive Bias

BigDataCon


Together with Uwe Schindler I had published a series of articles on Apache
Lucene at Software and Support Media’s Java Mag several years ago. Earlier this
year S&S kindly invited my to their BigDataCon - co-located with JAX to give a
talk of my choosing that at least touches upon Lucene.


Thinking back and forth about what topic to cover what came to my mind was to
give a talk on how easy it is to do text classification with Mahout when
relying on Apache Lucene for text analysis, tokenisation and token filtering.
All classes essentially are in place to integrate Lucene Analyzers with Mahout
vector generation - needed e.g. as a pre-processing step for classification or
text clustering.


Feel free to check out some of my sandbox code over at <a
href=“http://github.org/MaineC/sofia”>github</a>.


After attending the conference I can only recommend everyone interested in Java
programming and able to understand German to buy a ticket for the conference.
It’s really well executed, great selection of talks (though the sponsored
keynotes usually aren’t particularly interesting), tasty meals, interesting
people to chat with.

Inductive Bias

Hadoop Summit Amsterdam


About a month ago I attended the first European Hadoop Summit, organised by
Hortonworks in Amsterdam. The two day conference brought together both vendors
and users of Apache Hadoop for talks, exhibition and after conference beer
drinking.


Russel Jurney kindly asked me to chair the Hadoop applied track during
Apache Con EU. As a result I had a good excuse to attend the event. Overall
there were at least three times as many submissions than could reasonably be
accepted. Accordingly accepting proposals was pretty hard.


Though some of the Apache community aspect was missing at Hadoop summit it was
interesting nevertheless to see who is active in this space both as users as
well as vendors.


If you check out the talks on Youtube make sure to not miss the two sessions by
Ted Dunning as well as the talk on handling logging data by Twitter.

Inductive Bias

ApacheConNA: Misc


In his talk on Spdy Mathew Steele explained how he implemented the spdy protocol
as an Apache httpd module - working around most of the safety measures and
design decisions in the current httpd version. Essentially to get httpd to
support the protocol all you need now is mod_spdy plus a modified version of
mod_ssl.


The keynote on the last day was given by the Puppet founder. Some interesting
points to take away from that:


  • Though hard in the beginning (and half way through, and after years) it
    is important to learn giving up control: It usually is much more productive and
    leads to better results to encourage people to do something than to be
    restrictive about it. A single developer only has so much bandwidth - by
    farming tasks out to others - and giving them full control - you substantially
    increase your throughput without having to put in more energy.

  • Be transparent - it’s ok to have commercial goals with your project. Just
    make sure that the community knows about it and is not surprised to learn about
    it.

  • Be nice - not many succeed at this, not many are truely able to ignore
    religion (vi vs. emacs). This also means to be welcoming to newbies, to hustle
    at conferences, to engage the community as opposed to announcing changes.


  • Overall good advise for those starting or working on an OSS project and seeking
    to increase visibility and reach.

    If you want to learn more on what other talks were given at ApacheCon NA or want to follow up in more detail on some of the talks described here check out the slides archive online.

Inductive Bias

ApacheConNA: Hadoop metrics


Have you ever measured the general behaviour of your Hadoop jobs? Have you
sized your cluster accordingly? Do you know whether your work load really is IO
bound or CPU bound? Legend has it noone expecpt Allen Wittenauer over at
Linked.In, formerly Y! ever did this analysis for his clusters.


Steve Watt gave a pitch for actually going out into your datacenter measuring
what is going on there and adjusting the deployment accordingly: In small
clusters it may make sense to rely on raided disks instead of additional
storage nodes to guarantee “replication levels”. When going out to vendors to
buy hardware don’t rely on paper calculations only: Standard servers in Hadoop
clusters are 1 or 2u. This is quite unlike beefy boxes being sold otherwise.


Figure out what reference architecture is being used by partners, run your
standard workloads, adjust the configuration. If you want to run the 10TB
Terrasort to benchmark your hardware and system configuration. Make sure to
capture data during all your runs - have Ganglia or SAR, watch out for
intersting behaviour in io rates, cpu utilisation, network traffic. The goal is
to get the cpu busy, not wait for network or disk.


After the instrumentation and trial run look for over- and underprovisionings,
adjust, leather, rinse, repeat.


Also make sure to talk to the datacenter people: There are floor space, power
and cooling constraints to keep in mind. Don’t let the whole datacenter go down
because your cpu intensive job is drawing more power than the DC was designed
for. Ther are also power constraints per floor tile due to cooling issues -
those should dictate the design.


Take a close look at the disks you deploy: SATA vs. SAS can make a 40%
performance difference at a 20% cost difference. Also the number of cores per
machines dictates the number of disks to spread the likelyhood of random read
access. As a rule of thumb - in a 2U machine today there should be at least
twelve large form factor disks.


When it comes to controllers he goal should be to get a dedicated lane to disc,
safe one controller if price is an issue. Trade off compute power against power
consumption.


Designing your network keep in mind that one switch going down means that one
rack will be gone. This may be a non-issue in a Y! size cluster, in your
smaller scale world it might be worth the money investing in a second switch
though: Having 20 nodes go black isn’t a lot of fun if you cannot farm out the
work and re-replication to other nodes and racks. Also make sure to have enough
ports in rack switches for the machines you are planning to provision.


Avoid playing the ops whake-a-mole game by having one large cluster in the
organisation than many different ones where possible. Multi-tenancy in Hadoop is
still pre-mature though.


If you want to play with future deployments - watch out for HP currently
packing 270 servers where today are just two via system on a chip designs.

www.c3d2.de Newsfeed

TA: Lightning Talks #3

Datum
Dienstag, 21. Mai 2013 um 20:23 Uhr
Ort
HQ, Chaos Computer Club Dresden, Lignerallee 3

Erneut versuchen wir uns am blitzschnellen Wissensaustausch in 5-bis-15-Minuten-Häppchen. Wir treffen uns am kommenden Chaosdienstag zu mindestens den folgenden Themen:

Übrigens sammeln wir Vortragsideen jetzt in unserem Wiki. Bring dich ein, halte einen Lightning Talk!

Videomitschnitt: Pentaradio-Publishing-Prozess

Videomitschnitt: fish

Videomitschnitt: IPython

The Turkey Curse

Adblocking ist ein Sicherheitsthema

Lieber Stefan,

ich schreibe dir, weil ich eine Sache klarstellen möchte: Ich blocke Werbung nicht, weil mich Werbung nervt (ich sehe die nicht mehr) oder euch um eure Einnahmen zu bringen will, sondern weil die Werbung über JavaScript von für mich alles andere als vertrauenswürdige Drittanbieter ausgeliefert wird. Ich weiss durchaus guten Journalismus zu schätzen, weiss, dass dieser nicht umsonst zu haben ist und habe vermutlich mehr Verständnis für die Probleme der Verlage als viele andere denke ich.

Trotzdem kann ich nicht zulassen, dass irgendwelche Buden, die sich regelmässig pwnen lassen, meinen Rechner fernsteuern - denn das und nur das ist JavaScript: Remote Control eines Browsers durch den Server. Dieses Blocking ist also in erster Linie reiner Selbstschutz und ist auch jedem Nutzer sehr zu empfehlen.

Die Historie schädlicher Werbebanner auf Medienseiten ist lang und hat so ziemlich viele schon einmal irgendwann getroffen, sei es Zeit, Spon, Heise oder Handelsblatt um nur mal ein paar zu nennen, über die Schadcode verteilt wurde. Ein Problem ist dabei auch nicht zuletzt, dass es in so einem Falle ausser einer Entschuldigung keinerlei Entschädigung für die Opfer dieser Angriffe gab und gibt (sofern die Leute überhaupt merken, dass die gehackt wurden).

Ich bin nicht die Person, die tolle Ideen für Geschäftsmodelle hat, die funktionieren. Ich würde mir etwas wünschen, das für eure Branche nicht in Frage zu kommen scheint: Sowas wie eine Pauschale für alle Publikationen - in etwa nach dem Modell der GEZ für die Öffentlich-Rechtlichen. Ich nutze die Angebote eher sporadisch (meist auf Grund von Links), “blättere” aber praktisch auf den Seiten nie rum, womit ein Abo für mich keinen Sinn ergibt. Diese Art der Zahlung würde mir in meinem Nutzungsverhalten entgegen kommen.

Micropayment ist leider in den letzten Jahren nicht wirklich weiter gekommen und ausser Flattr sehe ich momentan wenig. Ich verstehe aber auch, dass sich damit kein Journalismus auf hohem Niveau lange finanzieren lässt.

Wie auch immer: Die Verantwortung für meine Sicherheit kann nur ich übernehmen - kein Staat, kein Verlag, kein Journalist und keine guten Worte. Die Konsequenz ist also, dass die Werbung, so auf die nicht verzichtet werden kann, entweder so eingebunden wird, dass sie ohne JavaScript auskommt oder sie wird schlicht geblockt. Denn mir ist Information zwar äusserst wichtig, aber nicht wichtiger als meine Sicherheit.

Mit freundlichen Grüßen, fukami

Inductive Bias

ApacheConNA: Monitoring httpd and Tomcat


Monitoring - a task generally neglected - or over done - during development.
But still vital enough to wake up people from well earned sleep at night when
done wrong. Rainer Jung provided some valuable insights on how to monitor Apache httpd and Tomcat.


Of course failure detection, alarms and notifications are all part of good
monitoring. However so is avoidance of false positives and metric collection,
visualisation, and collection in advance to help with capacity planning and
uncover irregular behaviour.


In general the standard pieces being monitored are load, cache utilisation,
memory, garbage collection and response times. What we do not see from all that
are times spent waiting for the backend, looping in code, blocked threads.


When it comes to monitoring Java - JMX is pretty much the standard choice. Data
is grouped in management beans (MBeans). Each Java process has default beans,
on top there are beans provided by Tomcat, on top there may be application
specific ones.


For remote access, there are Java clients that know the protocol - the server
must be configured though to accept their connection. Keep in mind to open the
firewall in between as well if there is any. Well known clients include
JVisualVM (nice for interactive inspection), jmxterm as a command line client.


The only issue: Most MBeans encode source code structure, where what you really
need is change rates. In general those are easy to infer though.


On the server side for Tomcat there is the JMXProxy in Tomcat manager that
exposes MBeans. In addition there is Jolohia (including JSon serialisation) or
the option to roll your own.


So what kind of information is in MBeans:


  • OS - load, process cpu time, physical memory, global OS level
    stats. As an example: Here deviding cpu time by time geves you the average cpu
    concurrency.


  • Runtime MBean gives uptime.

  • Threading MBean gives information on count, max available threads etc

  • Class Loading MBean should get stable unless you are using dynamic
    languaes or have enabled class unloading for jsps in Tomcat.

  • Compliation contains HotSpot compiler information.

  • Memory contains information on all regions thrown in one pot. If you need
    more fine grained information look out for the Memory Pool and GC MBeans.


  • As for Tomcat specific things:


    • Threadpool (for each connector) has information on size, number of busy
      threads.

    • GlobalRequestProc has request counts, processing times, max time bytes
      received/sent, error count (those that Tomcat notices that is).

    • RequestProcessor exists once per thread, it shows if a request is
      currently running and for how long. Nice to see if there are long running
      requests.

    • DataSource provides information on Tomcat provided database connections.


    • Per Webapp there are a couple of more MBeans:


      • ManagerMBean has information on session management - e.g. session
        counter since start, login rate, active sessions, expired sessions, max active
        sinse restart sessions (here a restart is possible), number of rejected
        sessions, average alive time, processing time it took to clean up sessions,
        create and required rate for last 100 sessions

      • ServletMBean contains request count, accumulated processing time.

      • JspMBean (together with activated loading/unloading policy) has
        information on unload and reload stats and provides the max number of loaded
        jsps.


      • For httpd the goals with monitoring are pretty similar. The only difference is
        the protocol used - in this case provided by the status module. As an
        alternative use the scoreboard connections.


        You will find information on


        • restart time, uptime

        • serverload

        • total number of accesses and traffic

        • idle workers and number of requests currently processed

        • cpu usage - though that is only accurate when all children are stopped
          which in production isn’t particularly likely.


        • Lines that indicate what threads do contain waitinng, request read, send reply
          - more information is documented online.


          When monitoring make sure to monitor not only production but also your stress
          tests to make meaningful comparisons.

Inductive Bias

ApacheConNA: On Security


During the security talk at Apache Con a topic commonly glossed over by
developers was covered in quite some detail: With software being developed that
is being deployed rather widely online (over 50% of all websites are powered
by the Apache webserver) natually security issues are of large concern.


Currently there are eight trustworthy people on the foundation-wide security
response team, subscribed to security@apache.org. The team was started by
William A. Rowe when he found a volnarability in httpd. The general work mode -
as opposed to the otherwise “all things open” way of doing things at Apache -
is to keep the issues found private until fixed and publicise widely
afterwards.


So when running Apache software on your servers - how do you learn about
security issues? There is no such thing as a priority list for specific
vendors. The only way to get an inside scoop is to join the respective
project’s PMC list - that is to get active yourself.


So what is being found? 90% of all security issues are found be security
researches. The remaining 10% are usually published accidentially - e.g. by
users submitting the issue through the regular public bug tracker of the
respective project.


In Tomcat currently no issues was disclosed w/o letting the project know. httpd
still is the prime target - even of security researchers who are in favour of
a full disclosure policy - the PMC cannot do a lot here other than fix issues
quickly (usually within 24 hours).


As a general rule of thumb: Keep your average release cycle time in mind - how
long will it take to get fixes into people’s hands? Communicate transparently
which version will get security fixes - and which won’t.


As for static analysis tools - many of those are written for web apps and as
such not very helpful for a container. What is highly dangerous in a web app
may just be the thing the container has to do to provide features to web apps.
As for Tomcat, they have made good experiences with Findbugs - most others have
too many false positives.


When dealing with a volnarability yourself, try to get guidance from the
security team on what is actually a security volnarability - though the final
decision is with the project.


Dealing with the tradeoff of working in private vs. exposing users affected by
the volnarability to attacks is up to the PMC. Some work in public but call the
actual fix a refactoring or cleanup. Given enough coding skills on the attacker
side this of course will not help too much as sort of reverse engineering what
is being fixed by the patches is still possible. On the other hand doing
everything in private on a separate branch isn’t public development anymore.


After this general introduction Mark gave a good overview of the good, the bad
and the ugly way of handling security issues in Tomcat. For his slides
(including an anecdote of what according to the timing and topic looks like it
was highly related to the 2011 Java Hash Collision talk at Chaos Communication
Congress).

Inductive Bias

ApacheConNA: On documentation


In her talk on documentation on OSS Noirin gave a great wrap up of the topic of
what documentation to create for a project and how to go about that task.


One way to think about documentation is to keep in mind that it fulfills
different tasks: There is conceptual, procedural and task-reference
documentation. When starting to analyse your docs you may first want to debug
the way it fails to help its users: “I can’t read my mail” really could mean
“My computer is under water”.


A good way to find awesome documentation can be to check out Stackoverflow
questions on your project, blog posts and training articles. Users today really
are searching instead of browsing docs. So where to find documentation actually
is starting to matter less. What does matter though is that those pages with
relevant information are written in a way that makes it easy to find them
through search engines: Provide a decent title, stable URLs, reasonable tags
and descriptions. By the way, both infra and docs people are happy to talk to
*good* SEO guys.


In terms of where to keep documentation:


For conceptual docs that need regular review it’s probably best to keep them in
version control. For task documentation steps should be easy to upgrade once
they fail for users. Make sure to accept bug reports in any form - be it on
Facebook, Twitter or in your issue tracker.


When writing good documentation always keep your audience in mind: If you don’t
have a specific one, pitch one. Don’t try to cater for everyone - if your docs
are too simplistic or too complex for others, link out to further material.
Understand their level of understanding. Understand what they will do after
reading the docs.


On a general level always include an about section, a system overview, a
description of when to read the doc, how to achieve the goal, provide
examples, provide a trouble shooting section and provide further information
links. Write breadth first - details are hard to fill in without a bigger
picture. Complete the overview section last. Call out context and
pre-requesites explicitly, don’t make your audience do more than they really
need to do. Reserve the details for a later document.


In general the most important and most general stuff as well as the basics
should come first. Mention the number of steps to be taken early. When it comes
to providing details: The more you provide, the more important the reader will
deem that part.

Inductive Bias

ApacheConNA: On delegation


In her talk on delegation Deb Nicholson touched upon a really important topic in
OSS: Your project may live longer than you are willing to support it yourself.


The first important point about delegation is to delegate - and to not wait
until you have to do it. Soon you will realise that mentoring and delegation
actually is a way to multiply your resources.


In order to delegate people to delegate to are needed. To find those it can be
helpful to understand what motivates people to work in general as well as on
open source in particular: Sure, fixing a given problem and working on great
software projects may be part of it. As important though is recognition
individually and in groups of people.


Keeping that in mind, “Thanking” is actually a license to print free money in
the open source world. Do it in a verbose manner to be believable, do it in
public and in a way that makes your contributors feel a little bit of glory.


Another way to lead people in is to help out socially: Facilitate connections,
suggest connections, introduce people. Based on the diversity of the project
you are working on you may be in a way larger network and have access to much
more corporations and communities than any peer who is not active. Use that
potential.


Also when leading OSS projects keep in eye on people being rude: Your project
should be accessible to facilitate participation.


In case of questions treat them as a welcome opportunity to pull a new
community member in: Answer quickly, answer on your list, delegate to middle
seniors to pull them in. Have training missions for people who want to get
started and don’t know your tooling yet. Have prepared documents to provide
links to in case questions occur.


In Apache we tend to argue people should not fall victim of volunteeritis.
Another way to put that is to make sure to avoid the licked cookie syndrom:
When people volunteer to do a task and never re-appear that task is tainted
until explicitly marked as “not taken” later on. One way to automate that is
to have a fixed deadline after which tasks are automatically marked as free to
take and tackle by anyone.


When it comes to the question of When to write documentation: There really is
no point in time that should stop you from contributing docs - all the way from
just above getting started level (writing the getting started docs for those
following you) up to the “I’m an awesome super-hacker” mode for those trying
to hack on similar areas.


Especially when delegating to newbies make sure to set the right expectations:
How long is it going to take to fix an issue, what is the task complexity, tell
them who is going to be involved, who is there to help out in case of road
blocks.


In general make sure to be a role model for the behaviour you want in your
project: Ask questions yourself, step back when your have taken on too much,
appreciate people stepping back.


Understand the motivation of your new comers - try to talk to them one on one
to understand their motivation and help to align work on the project with their
life goal. When starting to delegate, start with tasks that seem to small to
delegate at all to get new people familiar with the process - and to get
yourself familiar with the feeling of giving up control. Usually you will need
to pull tasks apart that before were done by one person. Don’t look for a
person replacement - instead look for separate tasks and how people can best
perform these.


Make visible and clear what you need: Is it code or reviews? Documentation or
translations, UX helpers? Incentivise what you really need - have code sprints,
gamify the process of creating better docs, put the logo creation under a
challenge.


All of this is great if you have only people who all contribute in a very
positive way. What if there is someone who’s contributions are actually
detrimental to the project? How to deal with bad people? They may not even do
so intentionally… One option is to find a task that better suits their
skills. Another might be to find another project for them that better fits
their way of communicating. Talk to the person in question, address head on
what is going on. Talking around or avoiding that conversation usually only
delays and enlarges your problem. One simple but effective strategy can be to
tell people what you would like them to do in order to help them find out that
this is not what they want to do - that they are not the right people for you
and should find a better place.


More on this can be found in material like “How assholes are killing your
project” as well as the “Poisonous people talk” and the book “Producing
open source software”.


On the how of dealing with bad people make sure to criticise privately first,
chack in a backchannel of other committers for their opinion - otherwise you
might be lonely very quickly. Keep to criticising the bahaviour instead of the
person itself. Most people really do not want to be a jerk.

Inductive Bias

ApacheConNA: First keynote


All three ApacheCon keynotes were focussed around the general theme of open
source communities. The first on given by Theo had very good advise to the
engineer not only striving to work on open source software but become an
excellent software developer:


  • Be loyal to the problem instead of to the code: You shouldn’t be
    addicted to any particular programming language or framework and refuse to work
    and get familiar with others. Avoid high specialisation and seek cross
    fertilisation. Instead of addiction to your tooling you should seek to
    diversify your toolset to use the best for your current problem.

  • Work towards becoming a generalist: Understand your stack top to bottom -
    starting with your code, potentially passing the VM it runs in up down to the
    hardware layer. Do the same to requirements you are exposed to: Being 1s old
    may be just good enough to be “always current” when thinking of a news
    serving web site. Try to understand the real problem that underpins a certain
    technical requirement that is being brought up to you. This deep understanding
    of how your system works can make the difference in fixing a production issue
    in three days instead of twelve weeks.


  • The last point is particularly interesting for those aiming to write scalable
    code: Software and frameworks today are intended to make development easier -
    with high probability they will break when running at the edge.


    What is unique about the ASF is the great opportunity to meet people with
    experience in many different technologies. In addition there is an unparalleled
    level of trust in a community as diverse as the ASF. One open question that
    remains is how to leverage this potential successfully within the foundation.

Inductive Bias

Apache Hadoop Get Together Berlin

This evening I joined the group over at Immobilienscout 24 for today’s Hadoop Get Together. David Obermann had invited Dr. Falk-Florian Henrich from CeleraOne to talk about their real-time analytics on live data streams.

Their system is being used by the New York Times Springer’s Die Welt for traffic analysis. The goal is to identify recurring users that might be willing to pay for the content they want to read. The trade-off here is to keep readers interested long enough to make them pay in the end, instead of scaring them away with a restrictive pay wall which would immediately lead to way less ad revenues.

Currently CeleraOne’s system is based on a combination of MongoDB for persistent storage, ZeroMQ for communicating with the revenue engine and http/json for connecting to the controlling web frontend. The live traffic analysis is all done in RAM, while long term storage ends up in MongoDB.

The second speaker was Michael Hausenblas from MapR. He spends most of his time contributing to Apache Drill - an open source implementation of Google’s Dremel.

Being an Apache project Drill is developed in an open, meritocratic way - contributors come from various different backgrounds. Currently Drill is in its early stages of development: They have a logical plan, a reference interpreter, a basic SQL parser. There is a demo application. As data backends they support HBase.

For most of the implementation they are trying to re-use existing libraries, e.g. for the columnar storage Drill is looking into either using Twitter’s Parquet or Hive ORC file format.

In terms of contributing to the project: There is no need to be a rockstar programmer to make valuable contributions to Apache Drill: Use cases, documentation, test data are all valuable and appreciated by the project.

For more information check out the slide deck (this is an older version - this nights edition most likely soon to be published):

If you missed today’s event make sure to get enlisted in the Hadoop Get Together Xing Group so next time you get a notification.

One thing to note though: When registering for the event - please make sure to free your ticket if you cannot make it. I had a few requests from people who would have loved to attend today who didn’t get a ticket but would most likely have fit into the room.

genius' blog

Faszinierend

“Das Urheberrecht, ursprünglich zum Ausgleich für darbende Künstler entworfen, erlaubt in diesem Zeitalter elektronischer Medien darbenden Anwälten, die Bürgerschaft auszuplündern.”

Otis D. Wright II

Inductive Bias

ApacheConNA: Meet the indian tribe

ApacheCon is the “User Conference of the Apache Software Foundation”. What
should that mean? If you are going to Apache Con you have the chance of meeting
committers of your favourite projects as well as members of the foundation
itself. Though there are a lot of talks that are interesting from a technical
point of view the goal really is to turn you into an active member of the
foundation yourself. This is true for the North American version even more than
for the European edition.


Though why should you as a general user of Apache software be interested in
attending then? Pieter Hintjens put it quite nicely in an interview on his
latest ZeroMQ book with O’Reilly:




If you are using free software in particular in commercial setups you really do
want to know how the project is governed and what it takes to get active and
involved yourself. What would it take to move the project into a direction that
fits your business needs? How do you make sure features you need are actually
being added to the project instead of useless stuff?


ApacheCon is the conference to find out how Apache projects work internally,
the place to be to meet active people in person and put faces to names. Lots of
community building events focus on getting newbies in touch with long term
contributors.

Inductive Bias

How to get your submission accepted at Berlin Buzzwords

Disclaimer: Intentionally posting on my private blog - these are my own criteria, not general advice from the review committee.

Berlin Buzzwords is in it’s fourth year. Probably the most tedious task of all is having to select talks to make it into the final schedule. With roughly 120 submissions and roughly 30 slots to fill the result is that three quarters of all submissions have to be rejected. Last year I shared some details on how we do talk ranking given reviewers have provided their input.

Now the mechanics of ranking are clear, people have asked me what goes into the reviews themselves. Here I can only speak for myself: After doing reviews ourselves during the first two years, Simon, Jan and myself decided to spread the work of reviewing submissions among a larger team of people. As nearly all of them had attended Berlin Buzzwords in the past already (or had at least followed the conference remotely) we could assume they were roughly familiar with what kind of content would be a good fit. As a result review guidelines that we send out tend to be rather light:

Berlin Buzzwords is a conference from geeks for geeks: The goal is to get the people actively working in the field together to meet and exchange ideas. Content should have some technical depth - in particular pure marketing talks and obvious product placements without further technical value are not welcome. We usually invite both, interesting case studies as well as talks highlighting the technical details a project is built upon.

In the end judgement is up to the individual reviewer - so I can speak only for myself when listing what you should do to get your talk accepted.

  • Be on topic. There’s always a handful of submissions that look and sound like pure marketing, product placement or simply aren’t related to software engineering at all. Those tend to be easy to spot and weed out.
  • Tell us what you are talking about. An abstract is there to provide some detail on your presentation - don’t be just funny, promising overly generic content. In order to decide whether or not your talk is relevant please provide some details on which direction you’ll be heading.
  • Don’t be too detailed in the abstract neither - there’s no need to list the content of every slide. Make sure the abstract correctly summarizes your talk, making it catchy and nice to read usually helps if the content is solid.
  • We try to find those speakers that have not only an interesting topic to talk about but are also a pleasure to listen to, who can successfully get their point across. We cannot know every potential speaker in person though. As a result it helps if you list which conferences you’ve spoken at in the past, any videos of previous talks is helpful as well. As a general piece of advice: Choosing Berlin Buzzwords as your first conference to speak at ever usually is a great way to disaster. Get some practice at local meetups like the Berlin Hadoop Get Together, the data science day, the Java User Group Berlin Brandenburg, the RecSys Stammtisch Berlin or the MongoDB User Group Berlin to name just a few.
  • Make sure your talk is novel - submitting the same topic in 2012 and 2013 is a great way to ensure getting rejected. Also it is fine to submit a talk you have given at another conference earlier. However if everyone in the Buzzwords audience is very likely to have watched the exact same version of your presentation earlier already, we are less likely to accept your talk.
  • Finally: When drafting your bio make sure to include details that explain why you are the perfect expert to talk about the topic at hand. As much as I’d like to I don’t know every project’s committer by name. Provide some help by pointing out explicitly what your contributions have been or in what context you have used the technology you are presenting. Don’t be shy to list that you are a co-founder of a successful project. Not only does this information help with selecting talks, it also provides some background for the audience to judge the claims you make.

Two words on the role of free software at Buzzwords: There is no explicit requirement to only talk about software that is publicly available under a free software license however if some project or framework is presented it helps to be open source to raise the applicability for the audience. Most projects discussed at Berlin Buzzwords are developed openly. In order to get the maximum out of these projects it pays to know how they work internally, how to get active yourself, how to contribute. As a result discussions and talks on project governance are generally welcome.

A parting note: With way more than half of all submissions to reject making a final decision will always be hard. Being rejected doesn’t necessarily mean that your proposal was bad. Following the above advise may raise chances of being accepted - however it is no guarantee. We could raise the number of accepted talks by extending the conference by another track or even another day - at the cost of raising the ticket price substantially. However we want not only “big corp representatives” but a diverse audience, attendees that get active themselves, that help shape the conference:

There’s plenty of space and time to get active in addition to the main conference program. Use the time and space to shape the conference.

E-Mail Address Index