LEAR coding guidelines
Rationale
Like most people hired at LEAR, you probably have a long experience of
coding, and alreay well-established coding habits. You typically have
an idea of what language you will code and with what libraries. You
know common coding practices well.
However, it turns out that code, especially when produced by
brillant coders, has annoying shortcommings. Therefore, please
consider the points in this page to avoid common mistakes.
Code at LEAR is always linked to a paper, referred to as "the paper"
in the following.
You should not need to talk about or show code to your advisors. Advisors
think about the paper, which does not contain code, and scientific
contributions. Only people who "have bugs" talk about code.
Objectives
What is expected from LEAR research code, ranked by importance.
Code should work
It should give good results in terms of precision, speed,... (whatever
you claim in the paper).
Random segfaults are not acceptable.
Code should be flexible
Refactoring should be easy. For example, you should be ready to
replace parts of your code or extract parts from it to be used
elsewhere.
The paper should be reproducible
Always assume that your advisor will ask you to re-run the
experiments.
You should know, and preferably state in LaTeX comments in the
paper, what to run to produce each number and figure in the
paper.
Seed random generators in a reproducible way.
Code should be transferrable
Even if you have been developing your stuff alone for 1.5 years, you
should assume that your code will be transferred: if the paper is
successful, people will want to re-use your code.
There have been several instances of PhDs leaving with code that was
too complicated for followers to re-use.
Recommendations
Baseline
Please start from the good coding habits you already have (or take a look at [1,2]):
- use a versioning system (svn or git). Creating a git repository
in a directory costs nothing.
- indent your code and be consistent in naming.
- do not optimize code unless needed.
But relax, there are typical software practices
that are not so important:
- portability: all machines run 64 bit linux. You can assume this
will last.
- uniform coding style: indent 3 or 8 spaces, nobody will care.
- documentation: there should not be much more to document than
what is written in the paper.
- helpful error messages: you can assume that the one running your
code is a developer, so assertions are ok.
Languages
Depending on the project, you may or may not be allowed to choose your
programming language.
If you choose a non-standard language, this will place a burden on followers,
so there should be a good reason for this.
Numerical languages
The main numerical languages used at Lear are Matlab/Octave and
Python/numpy. Python is a much richer language and does not have
licensing problems, but Matlab is simpler.
Low-level languages
The main low-level language is C. It is interfaced with mex for
Matlab, and cython or SWIG for Python, or simply called as a
subprocess.
Write code not headers
Some languages (read C++) require or encourage to write a lot of code
that does not actually translate to any machine instruction. When you
end up writing a lot of get/set, public/private, virtual, namespace,
or even comments, this should ring a bell.
Write simple code
It turns out that it is difficult to write simple code, because it is
difficult/subjective to define what simple is. Here are a few
tentative guidelines.
Do not write smart code
In the time you saved, think of smart research ideas.
Write shallow code
Deep call stacks are hard to follow, especially if functions are
scattered over several files in different directories,.
The guy re-using your your
code has the paper in his hands, and wants to see where equation (5)
is applied to data X, not follow a 3-level call stack.
Use simple languages (and features)
C is simpler than C++, Matlab is very simple, but it is possible to
write complicated code in any language.
Languages often have shiny "advanced" features.
Here is a table with a few examples:
Language | complicated features (non-exhaustive!)
|
C++ | templates, operator overloading, boost, C++11
|
Python | operator overloading, dynamic addition of methods to instances
|
Matlab | manipulation of caller's symbol table
|
If you think you need
advanced features, please think again.
If you still think so, please choose the smallest subset that
you can live with.
Although portability is not a major concern, it is a good test for
code simplicity. Does your Matlab code work with Octave? Does your
Python code run on Python 2.6? Does your C++ code compile on gcc 3.x?
Avoid genericity
Do you need grayscale images with pixels other than 32-bit float? (or
images in more than 2 dimensions!) Or matrices with elements other
than double? Are you ever going to use something else than L2
normalization?
Genericity comes at a cost in terms of lost focus and code bloat, so
use only if you really think that it will be useful. And remove if it
turns out that it was not necessary.
Avoid libraries
If reinventing the wheel takes 10 lines of code, please do
so. Dependencies always incur more work to understand. You can copy
code from libraries if relevant (and allowed by the license).
Corollary: avoid layers. For many useful libraries or programs
there are wrappers to make them "cleaner", "easier to use"
(eg. scikitlearn for libsvm, C++ mex interface above mex, Boost's
interface above BLAS, Python's threading module above theads). Please
evaluate whether the wrapper adds something significant to the
original library, or if you could write a more focused wrapper
yourself?
The F-word
Building "frameworks", "toolboxes" or "pipelines" is engineering
work. LEAR does not sell this kind of things.
Burry dead code
Unused code is harmful.
It is unlikely that more than 10 % of the code you write will be
in the execution path to a result in the paper. During development you will
test many variants, most of which fail or are not optimal. If these
variants are not worthwile to put in the paper, they are not worthwile
to keep in the code. Remove them.
"I keep it just in case" is not an option: use a source code
versioning system to keep track of them (svn or git). Do not "comment
out" code.
Old = trusted
At LEAR, people are doing science, not technique. There is rarely a
reason why bleeding-edge research must be built on new techniques.
Old libraries have undergone Darwinian selection. If they survived,
they are probably worth something.
Corollary: new = untrusted. New libraries or techniques should
be used very cautiously. We are not the beta-testers of the latest
machine learning package.
References
1. The Linux
kernel coding style is full of useful remarks, see eg. Section 8 about comments.
2. Google C++ Style Guide
coding style with arguments (favorite: do not use iostream/fstream).
matthijs douze
Last modified: Thu Jan 30 11:05:39 CET 2014