Monitor and Update Progress of BackupWorkers#12844
Monitor and Update Progress of BackupWorkers#12844akankshamahajan15 merged 5 commits intoapple:mainfrom
Conversation
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
9e393d1 to
26d44d4
Compare
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
26d44d4 to
09f6b13
Compare
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
| } | ||
| } | ||
|
|
||
| Future<Void> backupWorkerRangePartitioned(BackupInterface interf, |
There was a problem hiding this comment.
Moved this function to end and added more actors to add like monitorBackupRangePartitionedProgress,
pullAsync, uploadData etc
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
saintstack
left a comment
There was a problem hiding this comment.
The writeup on this PR is very nice. Its not in code though or in a doc. Should it be?
What are these <UID_abc123> ? These are the UID and range a backup worker is reponsible for? The start key? Is the UID not enough?
If a backup worker fails to write status, thats ok? The 'progress' is just stale?
saintstack
left a comment
There was a problem hiding this comment.
Looks good. What is the testing story? Is that in subsequent PRs?
| co_return; | ||
| } | ||
|
|
||
| Future<Void> monitorBackupRangePartitionedProgress(BackupRangePartitionedData* self) { |
There was a problem hiding this comment.
Ok this runs forever... no cancel?
There was a problem hiding this comment.
Yes. According to current BackupWorker, this function runs continuously.
When backup worker is removed, it throws an error worker_removed()
and in the catch, all actors are cancelled including monitor actor.
| const LogEpoch backupEpoch; // the epoch workers should pull mutations | ||
| // TODO akanksha: Update oldestBackupEpoch wherever needed. | ||
| LogEpoch oldestBackupEpoch = 0; // oldest epoch that still has data on tLogs for backup to pull | ||
| // Minimumum known committed version in StorageServers. |
| // update progress so far if previous epochs are done. | ||
| if (self->recruitedEpoch == self->oldestBackupEpoch) { | ||
| Version v = std::numeric_limits<Version>::max(); | ||
| // Find the version we can gurantee is fully backed up for all backup workers. |
| done = exitEarly ? Void() : uploadData(&self); | ||
|
|
||
| while (true) { | ||
| auto res = co_await race(dbInfoChange, done, error); |
Yes, Testing will be done once backup worker completes. Right now it's not integrated. So I'm kind of testing in a hacky way by doing mock servers. But it will be done in later PRs |
UID_abc123 is actually backup_worker id.
|
uploadData() and goes to backupWorker. |
| BackupWorkerDoneRequest(self.myId, self.backupEpoch))); | ||
| break; | ||
| } else if (res.index() != 2) { | ||
| UNREACHABLE(); |
There was a problem hiding this comment.
Does this just assert if any of the actor fails?
There was a problem hiding this comment.
Yes it prints an assertion failure message to stderr and creates a TraceEvent with severity SevError
cf76e9f to
8a179c5
Compare
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
This PR implements:
monitorBackupRangePartitionedProgresswhich reads the progress of all backup workers and set up backupkeys accordingly.BackupRangePartitionedProgressthat has minimal functions needed as of now (because it doesn't need log router stuff and key name and their respective decoding functions will be different).pullAsync,uploadData,shouldExitEarlyetcFlow:
Each backup worker uses a key in the key range \xff\x02/backupRangePartitionedProgress/
which is \xff\x02/backupRangePartitionedProgress/
UID_abc123 is backup worker id
┌────────────────────────────── ──┐
│ FDB System Keyspace (backupRangePartitionedProgressKeys range) |
│─────────────────────────────────┤
│ │
│ Key: \xff\x02/backupRangePartitionedProgress/<UID_abc123> │
│ Value: {epoch: 5, pop version: 1000000, tag: Tag(-2,0), totalTags: 3} │
│ ↑ Worker 0 progress │
│ │
│ Key: \xff\x02/backupRangePartitionedProgress/<UID_def456> │
│ Value: {epoch: 5, pop version: 999500, tag: Tag(-2,1), totalTags: 3} │
│ ↑ Worker 1 progress │
│ │
│ Key: \xff\x02/backupRangePartitionedProgress/<UID_ghi789> │
│ Value: {epoch: 5, pop version: 1001000, tag: Tag(-2,2), totalTags: 3} │
│ ↑ Worker 2 progress │
│ │
└────────────────────────────────┘
Each worker periodically calls saveProgress() to write to the database. Each worker writes it progress (Value) (same for all backups it is handling).
Only Worker 0 calls BackupWorker.monitorBackupRangePartitionedProgress() which further calls BackupRangePartitionedProgress.getBackupRangePartitionedProgress.
BackupProgress.getBackupRangePartitionedProgress:
a. It reads all the keys and values in tr.getRange(backupRangePartitionedProgressKeys) which is \xff\x02/backupRangePartitionedProgress/ and gets
a flat list of key value pairs. For ex:
For ex: