https://bugzilla.redhat.com/show_bug.cgi?id=2319926
Bug ID: 2319926 Summary: Review-request: python-html-text - Extract text from HTML Product: Fedora Version: rawhide OS: Linux Status: NEW Component: Package Review Severity: medium Assignee: nobody@fedoraproject.org Reporter: benson_muite@emailplus.org QA Contact: extras-qa@fedoraproject.org CC: package-review@lists.fedoraproject.org Target Milestone: --- Classification: Fedora
spec: https://download.copr.fedorainfracloud.org/results/fed500/gourmand/fedora-ra... srpm: https://download.copr.fedorainfracloud.org/results/fed500/gourmand/fedora-ra...
description: How is html_text different from .xpath('//text()') from LXML or .get_text() from Beautiful Soup?
- Text extracted with html_text does not contain inline styles, javascript, comments and other text that is not normally visible to users;
- html_text normalizes whitespace, but in a way smarter than .xpath('normalize-space()), adding spaces around inline elements (which are often used as block elements in html markup), and trying to avoid adding extra spaces for punctuation;
- html-text can add newlines (e.g. after headers or paragraphs), so that the output text looks more like how it is rendered in browsers.
fas: fed500
Comments: Pytest7 warning seems spurious as pytest7 is not installed.
Reproducible: Always