[Bug 2319926] New: Review-request: python-html-text - Extract text from HTML

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



https://bugzilla.redhat.com/show_bug.cgi?id=2319926

            Bug ID: 2319926
           Summary: Review-request: python-html-text - Extract text from
                    HTML
           Product: Fedora
           Version: rawhide
                OS: Linux
            Status: NEW
         Component: Package Review
          Severity: medium
          Assignee: nobody@xxxxxxxxxxxxxxxxx
          Reporter: benson_muite@xxxxxxxxxxxxx
        QA Contact: extras-qa@xxxxxxxxxxxxxxxxx
                CC: package-review@xxxxxxxxxxxxxxxxxxxxxxx
  Target Milestone: ---
    Classification: Fedora



spec:
https://download.copr.fedorainfracloud.org/results/fed500/gourmand/fedora-rawhide-x86_64/08156160-python-html-text/python-html-text.spec
srpm:
https://download.copr.fedorainfracloud.org/results/fed500/gourmand/fedora-rawhide-x86_64/08156160-python-html-text/python-html-text-0.6.2-1.fc42.src.rpm

description:
How is html_text different from .xpath('//text()') from LXML
or .get_text() from Beautiful Soup?

- Text extracted with html_text does not contain inline styles,
javascript, comments and other text that is not normally visible
to users;

- html_text normalizes whitespace, but in a way smarter than
.xpath('normalize-space()), adding spaces around inline elements
(which are often used as block elements in html markup), and trying
to avoid adding extra spaces for punctuation;

- html-text can add newlines (e.g. after headers or paragraphs), so
that the output text looks more like how it is rendered in browsers.

fas: fed500

Comments:
Pytest7 warning seems spurious as pytest7 is not installed.

Reproducible: Always


-- 
You are receiving this mail because:
You are always notified about changes to this product and component
You are on the CC list for the bug.
https://bugzilla.redhat.com/show_bug.cgi?id=2319926

Report this comment as SPAM: https://bugzilla.redhat.com/enter_bug.cgi?product=Bugzilla&format=report-spam&short_desc=Report%20of%20Bug%202319926%23c0

-- 
_______________________________________________
package-review mailing list -- package-review@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to package-review-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/package-review@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue




[Index of Archives]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite Conditions]     [KDE Users]

  Powered by Linux