diff --git a/docs/community/ai-tooling-policy.md b/docs/community/ai-tooling-policy.md new file mode 100644 index 000000000..a50325998 --- /dev/null +++ b/docs/community/ai-tooling-policy.md @@ -0,0 +1,18 @@ +# Guidelines for AI-assisted Contributions + +The Apache Wayang community welcomes the use of AI and generative tooling as part of the contribution process, provided contributors follow the guidelines below. These guidelines align with the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html). + +1. **Verify licensing compliance.** AI-generated code may inadvertently reproduce copyrighted material. Before submitting, ensure the output does not include content that conflicts with the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0) or the [ASF 3rd Party Licensing Policy](https://www.apache.org/legal/resolved.html). + +2. **Understand what you submit.** If you cannot explain why the code works, do not submit it. You are accountable for bugs, security issues, and license violations in your contribution. + +3. **Disclose AI tool usage in commits.** When any part of a contribution was generated or significantly assisted by an AI tool, include a `Generated-by:` token in the commit message. For example: + ``` + Fix null pointer in JdbcExecutor + + Generated-by: GitHub Copilot + ``` + +4. **Keep PR discussions human.** When participating in PR discussions, e.g., code review comments, questions, clarifications, and responses, the content must be written by a human, not generated by an AI tool. If an AI tool (such as GitHub Copilot) posts a comment, it must be clearly attributed as such and not presented as the contributor's own words. This ensures that code review remains a genuine exchange between people, preserving the quality, accountability, and community trust that Apache Wayang depends on. + +*These guidelines will be updated as AI tooling and the legal landscape around it continue to evolve. Questions or suggestions can be raised on the [dev mailing list](https://wayang.apache.org/docs/community/mailinglist).* diff --git a/docs/guide/getting-started.md b/docs/guide/examples.md similarity index 93% rename from docs/guide/getting-started.md rename to docs/guide/examples.md index b8fe3c09c..952e23896 100644 --- a/docs/guide/getting-started.md +++ b/docs/guide/examples.md @@ -1,7 +1,7 @@ --- -title: Getting started -sidebar_position: 2 -id: getting-started +title: Installation and Examples +sidebar_position: 10 +id: examples --- ## Requirements -Apache Wayang is built upon the foundations of Java 11 and Scala 2.12, providing a robust and versatile platform for data processing applications. If you intend to build Wayang from source, you will also need to have Apache Maven, the popular build automation tool, installed on your system. Additionally, be mindful that some of the processing platforms supported by Wayang may have their own specific installation requirements. +Apache Wayang is built upon the foundations of Java 11 and Scala 2.12, providing a robust and versatile platform for data processing applications. If you intend to build Wayang from source, you wi[...] ### Get Wayang -Apache Wayang is readily available through Maven Central, facilitating seamless integration into your development workflow. For instance, to utilize Wayang in your Maven-based project, simply add the following dependency to your project's POM file: +Apache Wayang is readily available through Maven Central, facilitating seamless integration into your development workflow. For instance, to utilize Wayang in your Maven-based project, simply add [...] ```xml org.apache.wayang @@ -69,7 +69,7 @@ If you need to rebuild Wayang, e.g., to use a different Scala version, you can s ``` ### Configure Wayang -To enable Apache Wayang's smooth operation, you need to equip it with details about your processing platforms' capabilities and how to interact with them. A default configuration is available for initial testing, but creating a properties file is generally preferable for fine-tuning the configuration to suit your specific requirements. To harness this personalized configuration effortlessly, launch your application via +To enable Apache Wayang's smooth operation, you need to equip it with details about your processing platforms' capabilities and how to interact with them. A default configuration is available for [...] ```shell $ java -Dwayang.configuration=url://to/my/wayang.properties ... ``` @@ -79,7 +79,7 @@ Essential configuration settings: * `wayang.core.log.enabled (= true)`: whether to log execution statistics to allow learning better cardinality and cost estimators for the optimizer * `wayang.core.log.executions (= ~/.wayang/executions.json)` where to log execution times of operator groups * `wayang.core.log.cardinalities (= ~/.wayang/cardinalities.json)` where to log cardinality measurements - * `wayang.core.optimizer.instrumentation (= org.apache.wayang.core.profiling.OutboundInstrumentationStrategy)`: where to measure cardinalities in Wayang plans; other options are `org.apache.wayang.core.profiling.NoInstrumentationStrategy` and `org.apache.wayang.core.profiling.FullInstrumentationStrategy` + * `wayang.core.optimizer.instrumentation (= org.apache.wayang.core.profiling.OutboundInstrumentationStrategy)`: where to measure cardinalities in Wayang plans; other options are `org.apache.wa[...] * `wayang.core.optimizer.reoptimize (= false)`: whether to progressively optimize Wayang plans * `wayang.basic.tempdir (= file:///tmp)`: where to store temporary files, in particular for inter-platform communication * Java Streams @@ -109,10 +109,10 @@ Essential configuration settings: * `wayang.postgres.cpu.mhz (= 2700)`: clock frequency of processor PostgreSQL runs on in MHz * `wayang.postgres.cpu.cores (= 2)`: number of cores PostgreSQL runs on -To effectively define your applications with Apache Wayang, utilize its Scala or Java API, conveniently found within the `wayang-api` module. For clear illustrations, refer to the provided examples below. +To effectively define your applications with Apache Wayang, utilize its Scala or Java API, conveniently found within the `wayang-api` module. For clear illustrations, refer to the provided exampl[...] ## Cost Functions -Wayang provides a utility to learn cost functions from historical execution data. Specifically, Wayang can learn configurations for load profile estimators (that estimate CPU load, disk load etc.) for both operators and UDFs, as long as the configuration provides a template for those estimators. +Wayang provides a utility to learn cost functions from historical execution data. Specifically, Wayang can learn configurations for load profile estimators (that estimate CPU load, disk load etc.[...] As an example, the `JavaMapOperator` draws its load profile estimator configuration via the configuration key `wayang.java.map.load`. Now, it is possible to specify a load profile estimator template in the configuration under the key `.template`, e.g.: @@ -122,7 +122,7 @@ wayang.java.map.load.template = {\ "cpu":"?*in0"\ } ``` -This template encapsulates a load profile estimator that requires at minimum one input cardinality and one output cardinality. Furthermore, it simulates CPU load by assuming a direct relationship with the input cardinality. However, more complex functions are possible. +This template encapsulates a load profile estimator that requires at minimum one input cardinality and one output cardinality. Furthermore, it simulates CPU load by assuming a direct relationship[...] In particular, you can use * the variables `in0`, `in1`, ... and `out0`, `out1`, ... to incorporate the input and output cardinalities, respectively; @@ -131,12 +131,12 @@ In particular, you can use * the functions `min(x0, x1, ...))`, `max(x0, x1, ...)`, `abs(x)`, `log(x, base)`, `ln(x)`, `ld(x)`; * and the constants `e` and `pi`. -While Apache Wayang provides templates for all execution operators, you will need to explicitly define your user-defined functions (UDFs) by specifying their cost functions, which are based on configuration parameters. This involves creating an initial specification and template for each UDF. +While Apache Wayang provides templates for all execution operators, you will need to explicitly define your user-defined functions (UDFs) by specifying their cost functions, which are based on co[...] As soon as execution data has been collected, you can initiate: ```shell java ... org.apache.wayang.profiler.ga.GeneticOptimizerApp [configuration URL [execution log]] ``` -This tool will attempt to determine suitable values for the question marks (`?`) within the load profile estimator templates, aligning them with the collected execution data and pre-defined configuration entries for the load profile estimators. These optimized values can then be directly incorporated into your configuration. +This tool will attempt to determine suitable values for the question marks (`?`) within the load profile estimator templates, aligning them with the collected execution data and pre-defined confi[...] ## Examples diff --git a/docs/guide/getting-started.mdx b/docs/guide/getting-started.mdx new file mode 100644 index 000000000..24cf47920 --- /dev/null +++ b/docs/guide/getting-started.mdx @@ -0,0 +1,332 @@ +--- +title: Getting started +sidebar_position: 1 +description: Write a data pipeline once and let Apache Wayang run it on the best available engine — your laptop, Spark, Flink, or a database. +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Getting Started + +**Write your data pipeline once. Run it anywhere.** + +You write your pipeline against a single API, then decide how it runs. Point it at one +engine and it runs there — or hand Wayang's cost-based optimizer the choice and let it pick +the best platform for each step across your laptop, Apache Spark, Apache Flink, or a +database, even splitting a single job across several. Either way, when your data outgrows +one machine you don't rewrite anything — you just make another engine available. + +This page gets you from zero to a running cross-platform pipeline in a few minutes. + +--- + +## How it works + +Most data tools lock you into one engine. Pick Spark, and your code is Spark code forever. +Outgrow it, or need a database in the mix, and you rewrite. + +Wayang sits one level up. You write a pipeline against Wayang's API and register the engines +you *have* — then it's your call. Want control? Register one engine and it runs there. Want +it handled? Register several and let Wayang's cost-based optimizer pick the best one for each +step: + +![A single pipeline, written once, feeds the Wayang optimizer, which routes each step to the best available engine — Local, Spark, Flink, Postgres, and others.](/img/wayang-architecture.svg) + +Same code on your laptop and on a 100-node cluster. + +--- + +## Quickstart + +We'll run a word count locally first — no cluster, nothing to install on a server — then +make Spark available with a one-line change. The pipeline itself never changes; only the +set of engines you register does. + +### 1. Run locally + + + + +```java +import org.apache.wayang.core.api.Configuration; +import org.apache.wayang.core.api.WayangContext; +import org.apache.wayang.api.JavaPlanBuilder; +import org.apache.wayang.basic.data.Tuple2; +import org.apache.wayang.java.Java; +import java.util.Arrays; + +public class WordCount { + public static void main(String[] args) { + // Register ONLY the local Java engine → runs on your machine, no cluster needed. + WayangContext wayang = new WayangContext(new Configuration()) + .withPlugin(Java.basicPlugin()); + + new JavaPlanBuilder(wayang) + .withJobName("WordCount") + .withUdfJarOf(WordCount.class) + .readTextFile("file:///path/to/input.txt") + .flatMap(line -> Arrays.asList(line.split("\\W+"))) + .filter(word -> !word.isEmpty()) + .map(word -> new Tuple2<>(word.toLowerCase(), 1)) + .reduceByKey(Tuple2::getField0, + (t1, t2) -> new Tuple2<>(t1.getField0(), t1.getField1() + t2.getField1())) + .writeTextFile("file:///path/to/output.txt", t -> t.getField0() + ": " + t.getField1()); + } +} +``` + + + + +```scala +import org.apache.wayang.api._ +import org.apache.wayang.core.api.{Configuration, WayangContext} +import org.apache.wayang.java.Java + +object WordCount { + def main(args: Array[String]): Unit = { + // Register ONLY the local Java engine → runs on your machine, no cluster needed. + val wayangCtx = new WayangContext(new Configuration) + wayangCtx.register(Java.basicPlugin) + + new PlanBuilder(wayangCtx) + .withJobName("WordCount") + .withUdfJarsOf(this.getClass) + .readTextFile("file:///path/to/input.txt") + .flatMap(_.split("\\W+")) + .filter(_.nonEmpty) + .map(word => (word.toLowerCase, 1)) + .reduceByKey(_._1, (c1, c2) => (c1._1, c1._2 + c2._2)) + .writeTextFile("file:///path/to/output.txt", t => s"${t._1}: ${t._2}") + } +} +``` + + + + +```python +from pywy.dataquanta import WayangContext +from pywy.platforms.java import JavaPlugin + +# Register ONLY the local Java engine → runs on your machine, no cluster needed. +ctx = WayangContext().register({JavaPlugin}) + +(ctx + .textfile("file:///path/to/input.txt") + .flatmap(lambda line: line.split()) + .filter(lambda word: word.strip() != "") + .map(lambda word: (word.lower(), 1)) + .reduce_by_key(lambda t: t[0], lambda t1, t2: (t1[0], int(t1[1]) + int(t2[1]))) + .store_textfile("file:///path/to/output.txt")) +``` + + + + +It executes locally. Good for development, tests, and small data. + +### 2. Run it on Spark + +Now run the *exact same pipeline* on Spark instead of locally. You don't touch the pipeline — +you change which platform you register: comment out Java and register Spark. + + + + +```java +import org.apache.wayang.spark.Spark; // ← swap the import + +// Same pipeline as before — only the registered platform changed. +WayangContext wayang = new WayangContext(new Configuration()) + // .withPlugin(Java.basicPlugin()) // ← comment out the local engine + .withPlugin(Spark.basicPlugin()); // ← register Spark instead +``` + + + + +```scala +import org.apache.wayang.spark.Spark // ← swap the import + +// Same pipeline as before — only the registered platform changed. +val wayangCtx = new WayangContext(new Configuration) +// wayangCtx.register(Java.basicPlugin) // ← comment out the local engine +wayangCtx.register(Spark.basicPlugin) // ← register Spark instead +``` + + + + +```python +from pywy.platforms.spark import SparkPlugin // ← use Spark instead of Java + +# Same pipeline as before — only the registered platform changed. +ctx = WayangContext().register({SparkPlugin}) +``` + + + + +Run it again. The same pipeline now executes on Spark — you changed *where* it runs without +changing *what* it does. Switch to Flink or any other supported platform the same way: swap +the import and the registered plugin. + + +### 3. Register both and let the optimizer choose + +This is the point of Wayang. In practice you don't have to pick a platform at all: you register every +engine you have and let the optimizer choose the best one for each step — even +splitting a single job across engines. The pipeline is still the same; you just stop deciding +where it runs. + + + + +```java +// Register BOTH platforms — Wayang's optimizer decides which to use per step. +WayangContext wayang = new WayangContext(new Configuration()) + .withPlugin(Java.basicPlugin()) + .withPlugin(Spark.basicPlugin()); +``` + + + + +```scala +// Register BOTH platforms — Wayang's optimizer decides which to use per step. +val wayangCtx = new WayangContext(new Configuration) +wayangCtx.register(Java.basicPlugin) +wayangCtx.register(Spark.basicPlugin) +``` + + + + +```python +# Register BOTH platforms — Wayang's optimizer decides which to use per step. +ctx = WayangContext().register({JavaPlugin, SparkPlugin}) +``` + + + + +Now Wayang owns the placement decision. For each operator it estimates the cost on every +registered platform and picks the cheapest — keeping a small job entirely local, pushing a +large one onto Spark, or mixing both within the same job as the data demands. You wrote the +pipeline once in step 1; steps 2 and 3 only changed which engines were on the table. + +> On a tiny input you'll see it keep everything local (that's the optimizer working +> correctly, not ignoring Spark). The cross-platform decisions show up once the data is big +> enough for them to pay off. Read [How Wayang chooses a platform](https://wayang.apache.org/docs/introduction/about) for what +> drives those choices. + +--- + +## Install + +Replace WAYANG_VERSION below with [latest maven release](https://mvnrepository.com/artifact/org.apache.wayang). + + + + +Maven: + +```xml + + org.apache.wayang + wayang-core + WAYANG_VERSION + + + org.apache.wayang + wayang-basic + WAYANG_VERSION + + + org.apache.wayang + wayang-api-scala-java + WAYANG_VERSION + + + + org.apache.wayang + wayang-java + WAYANG_VERSION + + + org.apache.wayang + wayang-spark + WAYANG_VERSION + +``` + +Or build from source (always works, no published artifact needed). Just make sure to use the [latest snapshot version](https://github.com/apache/wayang/blob/main/pom.xml): + +```bash +git clone https://github.com/apache/wayang.git +cd wayang +./mvnw clean install -DskipTests +``` + +To run on Spark, you'll need Apache Spark 3+ on your `PATH`. + + + + +:::note +Getting Python running today is involved — you build both the Python +package and the Wayang backend from source and start a local server before any pipeline +runs. A simpler one-command path (a prebuilt Docker image) is something the project wants to +offer, and help is welcome. +::: + +pywayang has no PyPI release yet, and it's a client that talks to a running Wayang REST +API — so setup is two parts: install the Python package, then start the Wayang backend. + +**1. Build and install the Python package** (from the repo root): + +```bash +cd python +pip install --upgrade build # may also require the python3-venv system package +python3 -m build +python3 -m pip install dist/pywy-WAYANG_VERSION.tar.gz +``` + +**2. Build the Wayang assembly and start the REST API** (from the repo root): + +```bash +./mvnw clean package -pl :wayang-assembly -Pdistribution +cd wayang-assembly/target/ +tar -xf apache-wayang-assembly-WAYANG_VERSION-dist.tar.gz +cd wayang-WAYANG_VERSION +./bin/wayang-submit org.apache.wayang.api.json.Main & +``` + +Before packaging, set the Python worker paths in +`wayang-api/wayang-api-python/src/main/resources/wayang-api-python-defaults.properties` +— the location of `python/src/pywy/execution/worker.py`, your `python3` command, and your +site-packages directory. You'll also need a Java 11+ runtime available. + +With the REST API running, the Python examples above will execute. See the +[pywayang README](https://github.com/apache/wayang/tree/main/python) for full details. + + + + +--- + + + +*Stuck on the first run? That's a documentation bug, and we want to know. +Ask on the [user mailing list](https://wayang.apache.org/docs/community/mailinglist).* \ No newline at end of file diff --git a/docs/guide/installation.md b/docs/guide/installation.md index 3bb53b459..10ee2d5ed 100644 --- a/docs/guide/installation.md +++ b/docs/guide/installation.md @@ -1,6 +1,6 @@ --- title: How to build Wayang -sidebar_position: 1 +sidebar_position: 2 id: installation --- @@ -59,8 +59,8 @@ echo "export PATH=${PATH}:${WAYANG_HOME}/bin" >> ~/.zshrc source ~/.zshrc ``` ### Others -- You need to install Apache Spark version 3 or higher. Don’t forget to set the `SPARK_HOME` environment variable. -- You need to install Apache Hadoop version 3 or higher. Don’t forget to set the `HADOOP_HOME` environment variable. +- You need to install Apache Spark version 3 or higher. Don't forget to set the `SPARK_HOME` environment variable. +- You need to install Apache Hadoop version 3 or higher. Don't forget to set the `HADOOP_HOME` environment variable. ## Run the program diff --git a/static/img/wayang-architecture.svg b/static/img/wayang-architecture.svg new file mode 100644 index 000000000..4ad985077 --- /dev/null +++ b/static/img/wayang-architecture.svg @@ -0,0 +1,78 @@ + + + + + + + + + + + + Your pipeline + written once + + + + + + + Wayang optimizer + chooses where each step runs + + + + + + + + + + + Local + + + Spark + + + Flink + + + Postgres + + …and any other supported platform +