[Mypaint-discuss] Surface optimizations proposed for merging
Posted by Jon Nordby on November 18, 2012 - 03:12:
I finished the last missing pieces of the surface optimization I
started a while back. The changes are not that invasive but it could
use some real-life testing before going into master. If no issues are
found I'd like for it to be a part of MyPaint 1.1 release.
The code is found in the "surface-optimizations" on mainline repository:
Please test! (checkout branch, build and run mypaint as normal)
== Changes ==
The optimizations follow a three-pronged strategy:
1. Reordering of data access to minimize fetching and updating of tiles.
2. Coarse grained parallelism using multithreading via OpenMP directives.
3. Fine grained parallelism using SSE via GCC auto-vectorization.
The MyPaint surface API has a concept of an atomic transaction:
surface.begin_atomic() and surface.end_atomic(). Inside such a
transaction, we call brush.stroke_to(surface, ...) each time there is
a motion event on the canvas. Depending on the brush configuration and
current state this may result in 0 to N surface.draw_dab() calls. N
can be in the order of 10-100.
Previously each draw_dab() call would fetch the affected tiles,
process the draw_dab operation and update the tiles with the results.
When subsequent draw_dab() calls affect the same tiles, fetching and
updating of tiles would happen up to N-1 times as often as is needed.
Now, each time draw_dab() is called, an operation struct is added to a
queue for each of the affected tiles before returning. No processing
is done at this point. When end_atomic() is called to complete the
transaction, the tiles that have pending operations are distributed
evenly among the processing threads. The processing of a tile is
completely independent of other tiles, allowing it to be done in a
When a get_color() request is made by the brush engine during a
surface transaction, the pending draw_dab operations on the affected
tiles must be flushed to return the correct value. Both the flushing
and calculation of the color is done multi-threaded in the same way as
Within each thread, SSE based vectorization is used to process a tile.
Currently this is limited to part of the brush mask calculation, as
the run-length encoding of the masks makes it difficult to
auto-vectorize all of the mask calculation and the
== Results ==
These results are on from my laptop, running Arch Linux current. CPU:
Dual-core Intel i5 M520@xxx GHz, 6GB RAM
Note: this benchmarks the *raw* surface rendering performance. The
user *may* experience speed-ups similar to what is shown here, but
this is is only if layer compositing and rendering to screen is not a
* 20% to 50% performance improvements for larger brushes (16 px+) on
the currently used Python-based backend.
* Performance does not regress significantly for small brushes, max
-4% degradation found.
* After the changes, GEGL-based backend is circa 30% faster than the
Python-based backend with 1 thread, and twice as fast with 2 threads.
A quad-core CPU with 4 threads will have an even higher speedup.
scons enable_gegl=true enable_openmp=true # to enable GEGL backend,
requires babl+gegl git
../../lib/test-python-surface # current python-based backend
./test-gegl-surface # GEGL backend
Look inside mypaint-test-surface.c to see/change the different test cases.
== Future ==
Given that the GEGL backend has a significantly higher raw
performance, I hope that after we release MyPaint 1.1 we can start the
transition to use it instead of our current backend.
I have some more ideas for further improve performance, and am working
to document these now.
Jon Nordby - www.jonnor.com
Powered by MHonArc
, Updated Thu Nov 22 22:00:07 2012