parser

Написать ответ на текущее сообщение

 

 
   команды управления поиском

И может быть имеет смысл потюнить сборку...

Sumo 17.01.2021 10:03 / 17.01.2021 10:14

Можно включить инкрементальную сборку, обычно это улучшает производительность.
/*
 * Note that this defines a large number of tuning hooks, which can
 * safely be ignored in nearly all cases.  For normal use it suffices
 * to call only GC_MALLOC and perhaps GC_REALLOC.
 * For better performance, also look at GC_MALLOC_ATOMIC, and
 * GC_enable_incremental.  If you need an action to be performed
 * immediately before an object is collected, look at GC_register_finalizer.
 * If you are using Solaris threads, look at the end of this file.
 * Everything else is best ignored unless you encounter performance
 * problems.
 */

/* Enable incremental/generational collection.  Not advisable unless    */
/* dirty bits are available or most heap objects are pointer-free       */
/* (atomic) or immutable.  Don't use in leak finding mode.  Ignored if  */
/* GC_dont_gc is non-zero.  Only the generational piece of this is      */
/* functional if GC_parallel is non-zero or if GC_time_limit is         */
/* GC_TIME_UNLIMITED.  Causes thread-local variant of GC_gcj_malloc()   */
/* to revert to locked allocation.  Must be called before any such      */
/* GC_gcj_malloc() calls.  For best performance, should be called as    */
/* early as possible.  On some platforms, calling it later may have     */
/* adverse effects.                                                     */
/* Safe to call before GC_INIT().  Includes a  GC_init() call.          */
GC_API void GC_CALL GC_enable_incremental(void);
Или даже, включить parallel mark, но это уже кажется некоторым перебором и не совсем понятен статус этой фичи в либе:
#ifdef GC_THREADS
  GC_API GC_ATTR_DEPRECATED int GC_parallel;
                        /* GC is parallelized for performance on        */
                        /* multiprocessors.  Currently set only         */
                        /* implicitly if collector is built with        */
                        /* PARALLEL_MARK defined and if either:         */
                        /*  Env variable GC_NPROC is set to > 1, or     */
                        /*  GC_NPROC is not set and this is an MP.      */
                        /* If GC_parallel is on (non-zero), incremental */
                        /* collection is only partially functional,     */
                        /* and may not be desirable.  The getter does   */
                        /* not use or need synchronization (i.e.        */
                        /* acquiring the GC lock).  Starting from       */
                        /* GC v7.3, GC_parallel value is equal to the   */
                        /* number of marker threads minus one (i.e.     */
                        /* number of existing parallel marker threads   */
                        /* excluding the initiating one).               */
  GC_API int GC_CALL GC_get_parallel(void);
В доке пишут такое:
## Performance

We conducted some simple experiments with a version of
[our GC benchmark](http://www.hboehm.info/gc/gc_bench/) that was slightly
modified to run multiple concurrent client threads in the same address space.
Each client thread does the same work as the original benchmark, but they
share a heap. This benchmark involves very little work outside of memory
allocation. This was run with GC 6.0alpha3 on a dual processor Pentium III/500
machine under Linux 2.2.12.

Running with a thread-unsafe collector, the benchmark ran in 9 seconds. With
the simple thread-safe collector, built with `-DGC_THREADS`, the execution
time increased to 10.3 seconds, or 23.5 elapsed seconds with two clients. (The
times for the `malloc`/`free` version with glibc `malloc` are 10.51 (standard
library, pthreads not linked), 20.90 (one thread, pthreads linked), and 24.55
seconds respectively. The benchmark favors a garbage collector, since most
objects are small.)

The following table gives execution times for the collector built with
parallel marking and thread-local allocation support
(`-DGC_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC`). We tested the client
using either one or two marker threads, and running one or two client threads.
Note that the client uses thread local allocation exclusively. With
`-DTHREAD_LOCAL_ALLOC` the collector switches to a locking strategy that
is better tuned to less frequent lock acquisition. The standard allocation
primitives thus perform slightly worse than without `-DTHREAD_LOCAL_ALLOC`,
and should be avoided in time-critical code.

(The results using `pthread_mutex_lock` directly for allocation locking would
have been worse still, at least for older versions of linuxthreads. With
`-DTHREAD_LOCAL_ALLOC`, we first repeatedly try to acquire the lock with
`pthread_mutex_try_lock`, busy-waiting between attempts. After a fixed number
of attempts, we use `pthread_mutex_lock`.)

These measurements do not use incremental collection, nor was prefetching
enabled in the marker. We used the C version of the benchmark. All
measurements are in elapsed seconds on an unloaded machine.

Number of threads| 1 marker thread (secs.) | 2 marker threads (secs.)
---|---|---
1 client| 10.45| 7.85 | 2 clients| 19.95| 12.3

The execution time for the single threaded case is slightly worse than with
simple locking. However, even the single-threaded benchmark runs faster than
even the thread-unsafe version if a second processor is available. The
execution time for two clients with thread local allocation time is only 1.4
times the sequential execution time for a single thread in a thread-unsafe
environment, even though it involves twice the client work. That represents
close to a factor of 2 improvement over the 2 client case with the old
collector. The old collector clearly still suffered from some contention
overhead, in spite of the fact that the locking scheme had been fairly well
tuned.

Full linear speedup (i.e. the same execution time for 1 client on one
processor as 2 clients on 2 processors) is probably not achievable on this
kind of hardware even with such a small number of processors, since the memory
system is a major constraint for the garbage collector, the processors usually
share a single memory bus, and thus the aggregate memory bandwidth does not
increase in proportion to the number of processors.

These results are likely to be very sensitive to both hardware and OS issues.
Preliminary experiments with an older Pentium Pro machine running an older
kernel were far less encouraging.