Skip to content

Commpath#182

Open
rountree wants to merge 53 commits into
llnl:develfrom
rountree:commpath
Open

Commpath#182
rountree wants to merge 53 commits into
llnl:develfrom
rountree:commpath

Conversation

@rountree
Copy link
Copy Markdown
Collaborator

Passed all tests on my clone. Let's see what it does over here.

Previously,

chosen_realized_cachepath was copied into
set_intercept_readlink_cachepath()

chosen_realized_cachepath and chosen_parsed_cachepath were copied
into set_should_intercept_cachepath()

This PR removes both setter functions and makes the original
pointers global.
Removes chosen_cachepath and cachepath_bitindex from
  spindle_launch.h

Updates initialization of matching variables in ldcs_process_data.

determineValidCachePaths() moved from spindle_be.cc to
  ldcs_audit_server_process.c to get ldcs_process_data visibility.

Added #include "parseloc.h" to ldcs_audit_server_process.c to get
  declaration of determineValidCachePaths().

Relocated "parseloc.h" to src/util so ldcs_audit_server_process.c
  could find it.

Trued up signedness of types caused my making "parseloc.h" more
  visible, e.g., cachepath_bitidx is now uint64_t everywhere.
The three-message-reply response is now a single message with
two strings.  The symbolic version of the cachepath is no longer
communicated as it was not being used.
New name is ldcs_audit_server_md_allreduce_AND().

If we get to the point where we're using other allreduce operations
we can solve the problem of duplicating the op list in md-land and
cobo-land.  For now, we're only using one op in md-land, so the
op can go into the function name.
Unlikely it would ever make a difference, but this is much more
correct.
src/client/beboot/spindle_bootstrap.c
    Moved orig_location from static global to local
    Renamed symbolic_location to symbolic_commpath
    Renamed orig_location to orig_commpath
    Renamed location to commpath
    Renamed LDCS_LOCATION to LDCS_COMMPATH

src/client/client/client.c
    Renamed LDCS_LOCATION to LDCS_COMMPATH
    Renamed location to commpath
src/client/client_comlib/client_api.c
    Added setenv("LDCS_CHOSEN_PARSED_CACHEPATH", local_cpc);

testsuite/test_driver.c
    Replaced LDCS_LOCATION and LDCS_ORIG_LOCATION checks for cachepath with
        LDCS_CHOSEN_PARSED_CACHEPATH
    Replaced spindle_loc with cachepath
All tests pass with both distinct and identical
commpaths/cachepaths.
Additionally populates /etc/environment just in case ssh is used
to set up the servers.
The option is marked as obsolete in configure --help and will
cause an error in configure if it is specified.

As updates the CI configure scripts to use --with-cachepaths and
--with-commpath instead of --with-localstorage.
Replaces args->location with args->commpath.
Additional integration for commpath + spank-plugin.
TMPDIR left out of a docker script, and --cachepath instead of
--cachepaths.
This commit fixes a race condition where eager client processes can
submit a LDCS_MSG_CHOSEN_CACHEPATH_REQUEST before the servers have
come to a consensus.  When that is the case, the server now responds
with LDCS_MSG_NO_CACHEPATH_CONSENSUS_YET and the client sleeps for
one second (max 10 retries) before sending the message again.

1)  include/ldcs_api.h
        Added LDCS_MSG_NO_CACHEPATH_CONSENSUS_YET
2)  server/comlib/ldcs_api_util.c
        Added STR_CASE entry for that message.
3)  server/auditserver/ldcs_audit_server_handlers.c
        Added global "static bool cachepath_consensus_reached" above handle_cachepath_consensus()
        Set that variable to true inside handle_cachepath_consensus()
        In handle_chosen_cachepath_request(), made msg.header.type conditional on cachepath_consensus_reached
4)  client/client_comlib/client_api.c
        This is the only place LDCS_MSG_CHOSEN_CACHEPATH_REQUEST is used.
        Will sleep for 1 second after each LDCS_MSG_NO_CACHEPATH_CONSENSUS_YET message.
The theory being that eager clients are using an uninitialized
cachepath variable.  By delaying the consensus, the failure should
happen more often.
"sending message of type: request_location_path"
is now
"sending message of type: CHOSEN_CACHEPATH_REQUEST"
Known to affect the symbolic form of candidate cachepaths.  Not
sure that's ever being used, but it's fixed now.
_message_type_to_str() can now be used in cobo_fe_comm.c.

ldcs_audit_server_fe_broadcast() now reports message type.

Only two messages are expected to be routed through there, but
it's the correct way to report it.
Cleanup now takes both commpath and cachepath and prefixes for
removing files created by Spindle.
The original LDCS_LOCATION_MOD checked to see if there were
multiple servers running on a node and, if so, modified the
location string so that each server had its own location.

The code did not handle the case where the directory above
the requested directory was not writeable, e.g., if the user
passed in --location=/tmp, the code would try to create a
directory /tmp-00 for the first server.  That fails.

With commpath and cachepath replacing location, and with new
initialization paths, the existing code would modify only
commpath after the commpath directory had been created.

If the multiple-server case needs to be supported, commpath- and
cachepath-specific code needs to be added back in.
That configure parameter is no longer supported.

Replaced with
        --with-cachepaths=/tmp/commpath/cachepath
        --with-commpath=/tmp/commpath
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants