Commpath#182
Open
rountree wants to merge 53 commits into
Open
Conversation
Previously, chosen_realized_cachepath was copied into set_intercept_readlink_cachepath() chosen_realized_cachepath and chosen_parsed_cachepath were copied into set_should_intercept_cachepath() This PR removes both setter functions and makes the original pointers global.
Removes chosen_cachepath and cachepath_bitindex from spindle_launch.h Updates initialization of matching variables in ldcs_process_data. determineValidCachePaths() moved from spindle_be.cc to ldcs_audit_server_process.c to get ldcs_process_data visibility. Added #include "parseloc.h" to ldcs_audit_server_process.c to get declaration of determineValidCachePaths(). Relocated "parseloc.h" to src/util so ldcs_audit_server_process.c could find it. Trued up signedness of types caused my making "parseloc.h" more visible, e.g., cachepath_bitidx is now uint64_t everywhere.
The three-message-reply response is now a single message with two strings. The symbolic version of the cachepath is no longer communicated as it was not being used.
New name is ldcs_audit_server_md_allreduce_AND(). If we get to the point where we're using other allreduce operations we can solve the problem of duplicating the op list in md-land and cobo-land. For now, we're only using one op in md-land, so the op can go into the function name.
Unlikely it would ever make a difference, but this is much more correct.
src/client/beboot/spindle_bootstrap.c
Moved orig_location from static global to local
Renamed symbolic_location to symbolic_commpath
Renamed orig_location to orig_commpath
Renamed location to commpath
Renamed LDCS_LOCATION to LDCS_COMMPATH
src/client/client/client.c
Renamed LDCS_LOCATION to LDCS_COMMPATH
Renamed location to commpath
src/client/client_comlib/client_api.c
Added setenv("LDCS_CHOSEN_PARSED_CACHEPATH", local_cpc);
testsuite/test_driver.c
Replaced LDCS_LOCATION and LDCS_ORIG_LOCATION checks for cachepath with
LDCS_CHOSEN_PARSED_CACHEPATH
Replaced spindle_loc with cachepath
All tests pass with both distinct and identical commpaths/cachepaths.
Additionally populates /etc/environment just in case ssh is used to set up the servers.
The option is marked as obsolete in configure --help and will cause an error in configure if it is specified. As updates the CI configure scripts to use --with-cachepaths and --with-commpath instead of --with-localstorage.
Replaces args->location with args->commpath.
Additional integration for commpath + spank-plugin.
TMPDIR left out of a docker script, and --cachepath instead of --cachepaths.
This commit fixes a race condition where eager client processes can
submit a LDCS_MSG_CHOSEN_CACHEPATH_REQUEST before the servers have
come to a consensus. When that is the case, the server now responds
with LDCS_MSG_NO_CACHEPATH_CONSENSUS_YET and the client sleeps for
one second (max 10 retries) before sending the message again.
1) include/ldcs_api.h
Added LDCS_MSG_NO_CACHEPATH_CONSENSUS_YET
2) server/comlib/ldcs_api_util.c
Added STR_CASE entry for that message.
3) server/auditserver/ldcs_audit_server_handlers.c
Added global "static bool cachepath_consensus_reached" above handle_cachepath_consensus()
Set that variable to true inside handle_cachepath_consensus()
In handle_chosen_cachepath_request(), made msg.header.type conditional on cachepath_consensus_reached
4) client/client_comlib/client_api.c
This is the only place LDCS_MSG_CHOSEN_CACHEPATH_REQUEST is used.
Will sleep for 1 second after each LDCS_MSG_NO_CACHEPATH_CONSENSUS_YET message.
The theory being that eager clients are using an uninitialized cachepath variable. By delaying the consensus, the failure should happen more often.
"sending message of type: request_location_path" is now "sending message of type: CHOSEN_CACHEPATH_REQUEST"
Known to affect the symbolic form of candidate cachepaths. Not sure that's ever being used, but it's fixed now.
_message_type_to_str() can now be used in cobo_fe_comm.c. ldcs_audit_server_fe_broadcast() now reports message type. Only two messages are expected to be routed through there, but it's the correct way to report it.
Cleanup now takes both commpath and cachepath and prefixes for removing files created by Spindle.
The original LDCS_LOCATION_MOD checked to see if there were multiple servers running on a node and, if so, modified the location string so that each server had its own location. The code did not handle the case where the directory above the requested directory was not writeable, e.g., if the user passed in --location=/tmp, the code would try to create a directory /tmp-00 for the first server. That fails. With commpath and cachepath replacing location, and with new initialization paths, the existing code would modify only commpath after the commpath directory had been created. If the multiple-server case needs to be supported, commpath- and cachepath-specific code needs to be added back in.
That configure parameter is no longer supported.
Replaced with
--with-cachepaths=/tmp/commpath/cachepath
--with-commpath=/tmp/commpath
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Passed all tests on my clone. Let's see what it does over here.