Hi,
Calibre builds are irreproducible, various png icons differ slightly between builds [1]. I know where the issue occurs, but I'm at a loss where the fix should be applied. We have Qt experts and enthusiasts in Fedora, so I thought I'd post here…
Icons are rendered [2] using the following code: img = QImage(path).scaled(int(width), int(height), Qt.AspectRatioMode.IgnoreAspectRatio, Qt.TransformationMode.SmoothTransformation) img.save(dest)
With python3-pyqt6-6.8.0-0.1.fc42.x86_64, we get a difference in how the icons are rendered:
calibre-7.20.0-1.fc42.x86_64 modified-S.5........ /usr/share/icons/hicolor/16x16/apps/calibre-gui.png modified-S.5........ /usr/share/icons/hicolor/32x32/apps/calibre-ebook-edit.png modified-S.5........ /usr/share/icons/hicolor/32x32/apps/calibre-gui.png modified-S.5........ /usr/share/icons/hicolor/32x32/apps/calibre-viewer.png ...
There are some tiny differences in shading of some pixels. The difference is not discernible visually for me. [1] has example icons attached.
Is this a bug in Qt and implementation of QImage.scaled [3] ?
[1] https://pagure.io/fedora-reproducible-builds/project/issue/20 [2] https://github.com/kovidgoyal/calibre/blob/3855552c193dceb8a75dbe1a29fbc40f0... [3] https://doc.qt.io/qt-6/qimage.html#scaled-1
Zbyszek
Zbigniew Jędrzejewski-Szmek wrote:
With python3-pyqt6-6.8.0-0.1.fc42.x86_64, we get a difference in how the icons are rendered:
calibre-7.20.0-1.fc42.x86_64 modified-S.5........ /usr/share/icons/hicolor/16x16/apps/calibre-gui.png modified-S.5........ /usr/share/icons/hicolor/32x32/apps/calibre-ebook-edit.png modified-S.5........ /usr/share/icons/hicolor/32x32/apps/calibre-gui.png modified-S.5........ /usr/share/icons/hicolor/32x32/apps/calibre-viewer.png ...
There are some tiny differences in shading of some pixels. The difference is not discernible visually for me. [1] has example icons attached.
Is this a bug in Qt and implementation of QImage.scaled [3] ?
As I understand the Qt source code, QImage.scaled with the Qt.TransformationMode.SmoothTransformation flag ends up calling QImage.smoothScaled (QImage.scaled calls the general QImage.transformed, which then detects the special case and calls QImage.smoothScaled), which in turn calls the private qSmoothScaleImage. And that one uses a different algorithm based on whether the CPU is runtime-detected to support SSE 4.1 or not. (For non-x86, there are also optimized implementations for ARM NEON and Longsoon LSX, also with runtime detection, otherwise the generic C implementation is used, as on pre-SSE-4.1 x86.) See https://code.qt.io/cgit/qt/qtbase.git/tree/src/gui/painting/qimagescale.cpp and https://code.qt.io/cgit/qt/qtbase.git/tree/src/gui/painting/qimagescale_sse4... . It is likely that the vectorized implementation rounds slightly differently. So you then end up with different results when building on non- identical builder hardware.
Kevin Kofler
On Sun, Nov 03, 2024 at 04:08:38AM +0100, Kevin Kofler via devel wrote:
Zbigniew Jędrzejewski-Szmek wrote:
With python3-pyqt6-6.8.0-0.1.fc42.x86_64, we get a difference in how the icons are rendered:
calibre-7.20.0-1.fc42.x86_64 modified-S.5........ /usr/share/icons/hicolor/16x16/apps/calibre-gui.png modified-S.5........ /usr/share/icons/hicolor/32x32/apps/calibre-ebook-edit.png modified-S.5........ /usr/share/icons/hicolor/32x32/apps/calibre-gui.png modified-S.5........ /usr/share/icons/hicolor/32x32/apps/calibre-viewer.png ...
There are some tiny differences in shading of some pixels. The difference is not discernible visually for me. [1] has example icons attached.
Is this a bug in Qt and implementation of QImage.scaled [3] ?
As I understand the Qt source code, QImage.scaled with the Qt.TransformationMode.SmoothTransformation flag ends up calling QImage.smoothScaled (QImage.scaled calls the general QImage.transformed, which then detects the special case and calls QImage.smoothScaled), which in turn calls the private qSmoothScaleImage. And that one uses a different algorithm based on whether the CPU is runtime-detected to support SSE 4.1 or not. (For non-x86, there are also optimized implementations for ARM NEON and Longsoon LSX, also with runtime detection, otherwise the generic C implementation is used, as on pre-SSE-4.1 x86.) See https://code.qt.io/cgit/qt/qtbase.git/tree/src/gui/painting/qimagescale.cpp and https://code.qt.io/cgit/qt/qtbase.git/tree/src/gui/painting/qimagescale_sse4... . It is likely that the vectorized implementation rounds slightly differently. So you then end up with different results when building on non- identical builder hardware.
Wow, thank you, that is a great find.
The koji build used GenuineIntel Intel Xeon Processor (Cascadelake), while my rebuilder used AuthenticAMD AMD EPYC 9R14. They both have SSE 4.1 (1,2), so theoretically qt_qimageScaleAARGBA_down_x_up_y_sse4() would be used in both cases. But those are significantly different CPUs, so it's seems possible that the difference is caused by the optimized vector implementations. I'm not sure though: could the exact same code deliver non-bit-identical results on different CPUs when processing 128-bit ints?
(1) fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat vnmi umip pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
(2) fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr rdpru wbnoinvd arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid flush_l1d
Zbyszek
Kevin’s observation about floating-point rounding and runtime dispatch is an excellent one in general.
Those two CPU’s should, as far as I can tell, be dispatched to the same SIMD implementations in this case.
Skimming https://github.com/qt/qtbase/blob/v6.8.0/src/gui/painting/qimagescale_sse4.c..., it looks like a fixed-point implementation that entirely avoids floating-poont operations. If there are no bugs, and if I’m not missing something, it should be possible to get identical results regardless of ISA extensions since no rounding is involved.
The fact that the scaling algorithm appears to be integer-based also makes the following sources of irreproducibility less likely, but maybe not impossible:
- Some algorithms compute “left-over” leading and/or trailing data with a scalar algorithm, and in some cases this could make the results depend on alignment of buffers in memory. Besides the fact that this is an integer implementation, at a glance, Qt doesn’t appear to be doing this. It looks like QImage must be aligned and (over-)allocated to allow everything to be done in SIMD, processing some extra pixels outside the image as necessary to make complete vectors.
- SIMD algorithms might operate on input values and combine pixels in a different order than scalar ones, which could result in different rounding for floating-point operations. That shouldn’t matter for an integer algorithm like this, except maybe in cases of wrapping/overflow – which might perhaps be in play here.
Another relevant fact is that the implementation is multi-threaded using a thread pool. If there is anything that depends on the order in which pixels/blocks are computed and combined, this could also result in different outputs, even in different runs on the same machine, and especially on machines with different numbers of cores.
All of this is written on a phone, without digging very deeply into the source or doing any practical experiments.
On Sun, Nov 3, 2024, at 7:38 AM, Zbigniew Jędrzejewski-Szmek wrote:
On Sun, Nov 03, 2024 at 04:08:38AM +0100, Kevin Kofler via devel wrote:
Zbigniew Jędrzejewski-Szmek wrote:
With python3-pyqt6-6.8.0-0.1.fc42.x86_64, we get a difference in how the icons are rendered:
calibre-7.20.0-1.fc42.x86_64 modified-S.5........ /usr/share/icons/hicolor/16x16/apps/calibre-gui.png modified-S.5........ /usr/share/icons/hicolor/32x32/apps/calibre-ebook-edit.png modified-S.5........ /usr/share/icons/hicolor/32x32/apps/calibre-gui.png modified-S.5........ /usr/share/icons/hicolor/32x32/apps/calibre-viewer.png ...
There are some tiny differences in shading of some pixels. The difference is not discernible visually for me. [1] has example icons attached.
Is this a bug in Qt and implementation of QImage.scaled [3] ?
As I understand the Qt source code, QImage.scaled with the Qt.TransformationMode.SmoothTransformation flag ends up calling QImage.smoothScaled (QImage.scaled calls the general QImage.transformed, which then detects the special case and calls QImage.smoothScaled), which in turn calls the private qSmoothScaleImage. And that one uses a different algorithm based on whether the CPU is runtime-detected to support SSE 4.1 or not. (For non-x86, there are also optimized implementations for ARM NEON and Longsoon LSX, also with runtime detection, otherwise the generic C implementation is used, as on pre-SSE-4.1 x86.) See https://code.qt.io/cgit/qt/qtbase.git/tree/src/gui/painting/qimagescale.cpp and https://code.qt.io/cgit/qt/qtbase.git/tree/src/gui/painting/qimagescale_sse4... . It is likely that the vectorized implementation rounds slightly differently. So you then end up with different results when building on non- identical builder hardware.
Wow, thank you, that is a great find.
The koji build used GenuineIntel Intel Xeon Processor (Cascadelake), while my rebuilder used AuthenticAMD AMD EPYC 9R14. They both have SSE 4.1 (1,2), so theoretically qt_qimageScaleAARGBA_down_x_up_y_sse4() would be used in both cases. But those are significantly different CPUs, so it's seems possible that the difference is caused by the optimized vector implementations. I'm not sure though: could the exact same code deliver non-bit-identical results on different CPUs when processing 128-bit ints?
(1) fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat vnmi umip pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
(2) fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr rdpru wbnoinvd arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid flush_l1d
Zbyszek
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
On Sun, Nov 03, 2024 at 11:02:16AM -0500, Ben Beasley wrote:
Kevin’s observation about floating-point rounding and runtime dispatch is an excellent one in general.
Those two CPU’s should, as far as I can tell, be dispatched to the same SIMD implementations in this case.
Skimming https://github.com/qt/qtbase/blob/v6.8.0/src/gui/painting/qimagescale_sse4.c..., it looks like a fixed-point implementation that entirely avoids floating-poont operations. If there are no bugs, and if I’m not missing something, it should be possible to get identical results regardless of ISA extensions since no rounding is involved.
The fact that the scaling algorithm appears to be integer-based also makes the following sources of irreproducibility less likely, but maybe not impossible:
Some algorithms compute “left-over” leading and/or trailing data with a scalar algorithm, and in some cases this could make the results depend on alignment of buffers in memory. Besides the fact that this is an integer implementation, at a glance, Qt doesn’t appear to be doing this. It looks like QImage must be aligned and (over-)allocated to allow everything to be done in SIMD, processing some extra pixels outside the image as necessary to make complete vectors.
SIMD algorithms might operate on input values and combine pixels in a different order than scalar ones, which could result in different rounding for floating-point operations. That shouldn’t matter for an integer algorithm like this, except maybe in cases of wrapping/overflow – which might perhaps be in play here.
Another relevant fact is that the implementation is multi-threaded using a thread pool. If there is anything that depends on the order in which pixels/blocks are computed and combined, this could also result in different outputs, even in different runs on the same machine, and especially on machines with different numbers of cores.
Thanks, those are all good considerations. I ran the conversion under valgrind just to make sure it's not some trivial memory bug, but valgrind doesn't report anything. This code is exectued from Python, so I think it's unlikely that there's some alignment problem or memory use bug. If there was, we'd be seeing it much more. (And if we were reading past the end of a buffer, I'd expect some actual corruption, i.e. random looking pixel values, not a subtle difference. So essentially the problem is that we have an integer algorithm where we don't expect any rounding effects, but we get an effect that looks like rounding error. ¯_(ツ)_/¯
Zbyszek