mail[Mypaint-discuss] Surface optimizations proposed for merging

Others Months | Index by Date | Thread Index
>>   [Date Prev] [Date Next] [Thread Prev] [Thread Next]



Posted by Jon Nordby on November 18, 2012 - 03:12:
Hi all,
I finished the last missing pieces of the surface optimization I
started a while back. The changes are not that invasive but it could
use some real-life testing before going into master. If no issues are
found I'd like for it to be a part of MyPaint 1.1 release.

The code is found in the "surface-optimizations" on mainline repository:
Please test! (checkout branch, build and run mypaint as normal)

== Changes ==
The optimizations follow a three-pronged strategy:

1. Reordering of data access to minimize fetching and updating of tiles.
2. Coarse grained parallelism using multithreading via OpenMP directives.
3. Fine grained parallelism using SSE via GCC auto-vectorization.

The MyPaint surface API has a concept of an atomic transaction:
surface.begin_atomic() and surface.end_atomic(). Inside such a
transaction, we call brush.stroke_to(surface, ...) each time there is
a motion event on the canvas. Depending on the brush configuration and
current state this may result in 0 to N surface.draw_dab() calls. N
can be in the order of 10-100.
Previously each draw_dab() call would fetch the affected tiles,
process the draw_dab operation and update the tiles with the results.
When subsequent draw_dab() calls affect the same tiles, fetching and
updating of tiles would happen up to N-1 times as often as is needed.

Now, each time draw_dab() is called, an operation struct is added to a
queue for each of the affected tiles before returning. No processing
is done at this point. When end_atomic() is called to complete the
transaction, the tiles that have pending operations are distributed
evenly among the processing threads. The processing of a tile is
completely independent of other tiles, allowing it to be done in a
lock-free manner.

When a get_color() request is made by the brush engine during a
surface transaction, the pending draw_dab operations on the affected
tiles must be flushed to return the correct value. Both the flushing
and calculation of the color is done multi-threaded in the same way as

Within each thread, SSE based vectorization is used to process a tile.
Currently this is limited to part of the brush mask calculation, as
the run-length encoding of the masks makes it difficult to
auto-vectorize all of the mask calculation and the

== Results ==
These results are on from my laptop, running Arch Linux current. CPU:
Dual-core Intel i5 M520@xxx GHz, 6GB RAM

Note: this benchmarks the *raw* surface rendering performance. The
user *may* experience speed-ups similar to what is shown here, but
this is is only if layer compositing and rendering to screen is not a

* 20% to 50% performance improvements for larger brushes (16 px+) on
the currently used Python-based backend.
* Performance does not regress significantly for small brushes, max
-4% degradation found.
* After the changes, GEGL-based backend is circa 30% faster than the
Python-based backend with 1 thread, and twice as fast with 2 threads.
A quad-core CPU with 4 threads will have an even higher speedup.

To reproduce:
  scons enable_gegl=true enable_openmp=true # to enable GEGL backend,
requires babl+gegl git
  cd brushlib/tests
  export PYTHONPATH=../../lib:../..
  export LD_LIBRARY_PATH=../..
  export GEGL_SWAP=RAM
  export OMP_NUM_THREADS=2
  ../../lib/test-python-surface # current python-based backend
  ./test-gegl-surface # GEGL backend

Look inside mypaint-test-surface.c to see/change the different test cases.

== Future ==
Given that the GEGL backend has a significantly higher raw
performance, I hope that after we release MyPaint 1.1 we can start the
transition to use it instead of our current backend.

I have some more ideas for further improve performance, and am working
to document these now.

Jon Nordby -

Related Messages

Powered by MHonArc, Updated Thu Nov 22 22:00:07 2012