czyrym

Oto czyrym – biblioteka pythonowa do wykrywania, czy dwa polskie słowa się rymują. Używa ona wiązanej sznurkiem na kolanie heurystyki: wyciąga ze słowa fragment od przedostatniej (ostatniej dla słów jednosylabowych) samogłoski do końca, próbuje poddawać różnym ręcznie wymyślonym przeze mnie mutacjom (działającym na literach, nie na głoskach) - na przykład zmienia "kł$" na "k" itp - po czym sprawdza, czy któraś z mutacji spowoduje, że wynik dla obu słów będzie ten sam. Bibliotekę przetestowałem na moim korpusie polskich rymów z https://czterycztery.pl/programy/slownik_rymow. Korpus zawiera 8100 rymów - czyrym skutecznie wykrywa 7212 z nich. Ilość false positive'ów jest dość niski: z 37177 nierymów w korpusie, czyrym tylko dla 250 z nich twierdzi, że są rymami (a w przypadku niektórych z nich można by nawet dyskutować, że może naprawdę są rymami, tylko niedoskonałymi). Kiedy czyrym wykrywa parę słów jako rymy, podaje odległość między dwoma słowami jako liczbę między 0 (rym doskonały) a bardzo dużą liczbą (prawie w ogóle nie rym) - więc jeśli chcesz, żeby czyrym był mniej wyrywny, możesz porównywać tę odległość z wybranym przez ciebie progiem.

Czyrym is a Python library to detect if two Polish words rhyme. The library has been tested on a corpus of Polish rhymes from czterycztery.pl. The corpus contains 8100 rhymes - czyrym successfully detects 7212 of them. The rate of false positives is quite low: from 37177 non-rhymes in the corpus, czyrym incorrectly detects only 250 of them as rhymes (and for some of them it can be argued that they really are rhymes, just quite imperfect). When czyrym detects a pair of words as rhymes, it gives a distance between the two words as a number between 0 (a perfect rhyme) and a very big number (hardly a rhyme at all) - so if you want czyrym to be more conservative, you can compare this distance with your threshold.

download, install

Bibliotekę możesz zainstalować przez pip install czyrym. Jeśli chciałbyś przy niej dłubać, sforkować, poprawiać, stąd możesz ściągnąć sobie źródła.

usage

Oto przykład użycia:

import czyrym
from typing import Optional
first: str = czyrym.normalize_word('Róż, ty???')
second: str = czyrym.normalize_word('ruszty')

# if you just want to know if two words rhyme or not, do:
if czyrym.is_rhyme(first, second):
    print(f'"{first}" and "{second}" rhyme')
else:
    print(f'"{first}" and "{second}" do not rhyme')
    
# if you want to know if two words rhyme or not and what is the distance between them, do:
match: Optional[czyrym.RhymeMatch] = czyrym.find_rhyme_match(first, second)
if match is None:
    print(f'"{first}" and "{second}" don\'t rhyme')
else:
    print(f'A match for "{first}" and "{second}" was found - they rhyme. Total cost of the rhyme (a distance between words) is {match.total_cost}.')

# if you want to understand how it was decided that two words rhyme, do:
if match is None:
    print(f'a match for "{first}" and "{second}" was not found - they don\'t rhyme')
else:
    print(f'A match for "{first}" and "{second}" was found - they rhyme. Their common suffix is {match.common_form}.')
    for word, path in (first, match.first_path), (second, match.second_path):
        if len(path.steps) <= 1:
            print(f'For "{word}" the suffix was not mutated.')
        else:
            print(f'For "{word}" the suffix was produced with these steps:')
            for step in path.steps:
                print(f'  "{step.before}" -> "{step.after}" (mutator: {step.mutator_name}, cost: {step.cost})')
            print(f'Total cost of these steps was {path.cost}.')

Oto wynik uruchomienia tego przykładu:

 
$ python3 example.py
"RÓŻ TY" and "RUSZTY" rhyme
A match for "RÓŻ TY" and "RUSZTY" was found - they rhyme. Total cost of the rhyme (a distance between words) is 1.0.
A match for "RÓŻ TY" and "RUSZTY" was found - they rhyme. Their common suffix is USZTY.
For "RÓŻ TY" the suffix was produced with these steps:
  "ÓŻTY" -> "UŻTY" (mutator: simple_Ó_to_U, cost: 0.0)
  "UŻTY" -> "USZTY" (mutator: z_dot_to_sz, cost: 1.0)
Total cost of these steps was 1.0.
For "RUSZTY" the suffix was not mutated.