Skip to content

Vectorized minmax_element: avoid runtime sign dispatch #6176

@AlexGuteniev

Description

@AlexGuteniev

We pass _Sign in minmax_element as a runtime parameter, and then have to dispatch on it for ARM64:

#if defined(_M_ARM64) || defined(_M_ARM64EC)
if constexpr (!std::is_same_v<typename _Traits::_Neon, _Traits_8_neon>) {
if (_Byte_length(_First, _Last) >= 16) {
if (_Sign) {
return _Minmax_element_impl<_Mode, typename _Traits::_Neon, true>(_First, _Last);
} else {
return _Minmax_element_impl<_Mode, typename _Traits::_Neon, false>(_First, _Last);
}
}
}
if (_Sign) {
return _Minmax_element_impl<_Mode, typename _Traits::_Scalar, true>(_First, _Last);
} else {
return _Minmax_element_impl<_Mode, typename _Traits::_Scalar, false>(_First, _Last);
}
#else // ^^^ defined(_M_ARM64) || defined(_M_ARM64EC) / !defined(_M_ARM64) && !defined(_M_ARM64EC) vvv

We should have the entry functions based on sign too, like here:

STL/stl/inc/algorithm

Lines 75 to 84 in 2626cf1

__declspec(noalias) _Min_max_1i __stdcall __std_minmax_1i(const void* _First, const void* _Last) noexcept;
__declspec(noalias) _Min_max_1u __stdcall __std_minmax_1u(const void* _First, const void* _Last) noexcept;
__declspec(noalias) _Min_max_2i __stdcall __std_minmax_2i(const void* _First, const void* _Last) noexcept;
__declspec(noalias) _Min_max_2u __stdcall __std_minmax_2u(const void* _First, const void* _Last) noexcept;
__declspec(noalias) _Min_max_4i __stdcall __std_minmax_4i(const void* _First, const void* _Last) noexcept;
__declspec(noalias) _Min_max_4u __stdcall __std_minmax_4u(const void* _First, const void* _Last) noexcept;
__declspec(noalias) _Min_max_8i __stdcall __std_minmax_8i(const void* _First, const void* _Last) noexcept;
__declspec(noalias) _Min_max_8u __stdcall __std_minmax_8u(const void* _First, const void* _Last) noexcept;
__declspec(noalias) _Min_max_f __stdcall __std_minmax_f(const void* _First, const void* _Last) noexcept;
__declspec(noalias) _Min_max_d __stdcall __std_minmax_d(const void* _First, const void* _Last) noexcept;

The current status quo is the result of (micro-)optimization for SSE4.2 and AVX2. they don't have unsigned comparisons. So the signed intrinsics have to be used on unsigned paths too, and with the runtime parameter we unify the signed and the unsigned path with a very minor runtime cost.

But as we have ARM64, and also we can have unsigned comparisons for AVX-512, we need separate code paths.

This also incorporates unnecessary parameter removal flor floats

// TRANSITION, ABI: remove unused `bool`
const void* __stdcall __std_min_element_f(const void* const _First, const void* const _Last, bool) noexcept {
return _Sorting::_Minmax_element_disp<_Sorting::_Mode_min, _Sorting::_Traits_f>(_First, _Last, false);
}

I'm not sure if we should fix these now (and leave old functions for compatibility), or wait for vNext and do it cleanly at once.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ARM64Related to the ARM64 architectureARM64ECI can't believe it's not x64!fixedSomething works now, yay!performanceMust go faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions