Verifying results across architectures, for SPEC benchmarking

@geuzaine Thanks for willing to look at some reproducibility issues. The big challenge in crafting benchmarks is making sure they produce the same result across architectures and compiler optimization levels. I'm attaching the output from one of the example files, bubble_interphase_layers.geo. I am using this as the "test" input, and since it is quick to run I hope it is easy to debug.

gmsh_verification.tar.gz

We started with this snapshot of gmsh and put it into the SPEC harness: Gmsh 4.11.0-git-5aea61f4c

The directories here "aarch64" and "x86-64" contain the output from running the cmdline below.

  $ gmsh -option gmsh.opts -smooth 0 -2 bubble_interphase_layers.geo > bubble_interphase_layers.geo.out

I have transferred the code into the SPEC harness and have used gcc-12.1 and -O0 optimization level to build for now. We make sure to build all the files with...

  -fno-unsafe-math-optimizations -ffp-contract=off

...since numeric/robustPredicates.cpp requires that relaxation. SPEC run rules indicate that all source files must be compiled with the same optimization flags.

I copied all the relevant output files and saved them in each directory inside the tarball. I took the final .msh generated and plotted it using the visualization tool you offer for download (gmsh.exe binary with CAD for Windows), and that is the .png file in each directory. It is clear that the work was accomplished, albeit with some small differences.

I then downloaded the latest mainline, built on aarch64 and ran again on the same input. This led to a third slightly different result, and the result is in the "aarch64-mainline" directory. (This confused me even more).

I would love help on isolating the cause of the differences. Alternatively, if you have another idea on how to verify across many architectures that the work was completed satisfactorily, we are open to these ideas. For example, I tried doing an image compare on the YUV files below which told me they are 90% similar, pixel to pixel. Good enough perhaps?

For example, can you spot the 100000 differences between these two images? 😇