Hc/fix impossible schemes#43
Conversation
K-Meech
left a comment
There was a problem hiding this comment.
Thanks @HenryCrosswell - this is looking great! Your additions are filtering the impossible schemes really nicely, which will stop any incorrect data making it to the later processing stages.
I've put a few comments below - most are minor wording suggestions, but the main one is we need to make sure un-genotyped individuals can pass through the standardise process without erroring.
| pooch_data_path("standardised-data-forbidden-genotypes.csv") | ||
| ) | ||
|
|
||
| with pytest.raises(TypeError): |
There was a problem hiding this comment.
By adding with pytest.raises(TypeError), this test is allowed to pass even though it throws an error during standardise_pyrat_csv and never reaches pd.testing.assert_frame_equal. This means it is no longer testing that genotypes are standardised correctly.
You'll need to remove the with pytest.raises(TypeError) and rather add a fix in _remove_impossible_breeding_schemes instead. Otherwise, at the moment, any ungenotyped individuals that enter standardise_pyrat_csv will cause the processing to stop early (we need these ungenotyped individuals later in the historical stats part - see this issue: #8)
There was a problem hiding this comment.
I have addressed this issues, however it has removed an impossible scheme from the impossible genotype dataset. Causing issues in this tests - test_handling_ungenotyped_individuals_in_stats. I attempted to fix by removing and then fixing the standardised offspring. However it caused a tangle of issues. So I have left the test_data as is, leaving only the dataframe missmatch.
[left]: (8, 11)
[right]: (9, 11)
There was a problem hiding this comment.
Thanks @HenryCrosswell - I've updated ID_011 and the corresponding tests on this PR: #50 and on GIN. Once that is merged, you should be able to merge main into this branch and GIN's master into your GIN branch and hopefully the tests will pass.
There was a problem hiding this comment.
@HenryCrosswell #50 is merged now, so could you update your branches as mentioned above? If you open a PR on GIN I can take a quick look over that and merge it 👍
| genotype_father = standardised_df_row["genotype_father"] | ||
| genotype_mother = standardised_df_row["genotype_mother"] | ||
| genotype_offspring = standardised_df_row["genotype_offspring"] | ||
|
|
There was a problem hiding this comment.
To keep rows with un-genotyped offspring, you could add something like this here:
# If the offspring are un-genotyped, we keep the row as there is no way
# of checking if the breeding scheme was valid
if pd.isna(genotype_offspring):
return pop
Hopefully this should make the test_standardise_genotypes test pass, although I see that there is one remaining impossible breeding scheme in the test file I missed. You can change: pyrat-data-forbidden-genotypes.csv / standardised-data-forbidden-genotypes.csv ID-011 genotype_mother to wt_het on your GIN branch.
|
Tackled all your comments, although i did it within vscode, so there might be some small differences - let me know if you want me to change anything else. |
Before submitting a pull request (PR), please read the contributing guide.
Please fill out as much of this template as you can, but if you have any problems or questions, just leave a comment and we will help out :)
Description
What is this PR
Why is this PR needed?
Since the PyRAT db is manually populated, there is a possibility that someone could have incorrectly entered genotypes for the offspring.
What does this PR do?
This adds a function to standardise.py that checks whether the offsprings genotype is a possible combination of the two parents, using the Mendelian ratios.
References
closes issue #10
How has this PR been tested?
Added new test to test_standardise, along with new test data which includes a raw file with impossible schemes, and a standardised file that asserts that those schemes have been removed.
Is this a breaking change?
No
Does this PR require an update to the documentation?
Not yet.
Checklist: