Poppler: Displaying PDF Files with Qt
Материал из Wiki.crossplatform.ru
Qt Quarterly | Выпуск 27 | Документация |
by David Boddie
Как мы видели ранее, Qt может быть использована для генерации документов с постоянно расширяемым спектром форматов, которые можно просматривать и редактировать из внешних приложений. Qt поставляется с объектами для отображения HTML "из коробки", и может создавать свои собственные "предпросмотры печати", ну а что с другими форматами файлов, которые вне Qt приложений?
Содержание |
К счастью, существуют сторонние библиотеки для некоторых вещей, которые Qt не предоставляет. Одна из них - Poppler, библиотека рендеринга Portable Document Format (PDF), которая лежит в основе ряда широко используемых приложений для просмотра PDF. Poppler - это форк программы просмотра PDF Xpdf, которая лицензирована под GNU General Public License. Xpdf также можно получить на других условиях лицензирования.
Poppler разработан таким образом, что его можно использовать с любым инструментарием или фреймворком, если только имеется подходящий бэкенд рендеринга. Разработчикам Qt-приложений повезло в том, что также доступен Qt-фронтенд— набор классов Qt-стиля, которые используют Qt-классы для описания частей PDF-документов.
В этой статье мы кратко рассмотрим некоторые возможности, предоставляемые Poppler, в контексте создания простого приложения для просмотра PDF.
[править] Setting Things Up
Разработчики, использующие Linux, должны обнаружить, что Poppler и фронтенд Qt 4 доступны в виде пакета для большинства последних дистрибутивов. Разработчики под Windows, Mac OS X и другие Unix платформы могут загрузить исходный код с сайта poppler.freedesktop.org.
По умолчанию Poppler собирается со всеми видами фронтендов и бэкендов. Если вы компилируете Poppler из исходников, вы можете исключить некоторые из них для экономии времени компиляции. При настройке сборки может быть проще установить префикс установки, используемый для установки Qt— этот префикс является каталогом, под которым хранятся подкаталоги, содержащие исполняемые файлы, библиотеки и файлы данных.
Важно знать, где будут установлены библиотека Poppler и заголовочные файлы, поскольку они понадобятся нашему примеру.
[править] Отрисовка документов
В нашем примере мы предоставляем простой пользовательский интерфейс для отображения файлов PDF, показывая по одной странице за раз и предоставляя элементы управления для перемещения пользователя между страницами. Каждая страница отображается в пользовательском виджете DocumentWidget, расположенном в центральном виджете главного окна - области прокрутки.
The user opens a new file via a file dialog, which we open in response to an action being triggered. The path to the file is passed to the DocumentWidget so that the document it contains can be fed to the Poppler library.
Unlike with many Qt classes, we load a document using a static function in the following way:
Poppler::Document *doc = Poppler::Document::load(path);
If the document returned is not null, we have a document that we can explore. Note that our example takes ownership of the document, so we must remember to dispose of it when we have finished with it.
Each document contains a series of pages that can be obtained one by one using the Document::page() function. Although the Document class has a collection of functions to control the appearance of the document, actual rendering is performed by each Page object. In our example, we render pages into QImage objects that we display using the DocumentWidget, itself just a simple QLabel subclass.
The key part of our DocumentWidget::showPage() function looks like this:
void DocumentWidget::showPage(int page) { QImage image = doc->page(currentPage)->renderToImage( scaleFactor * physicalDpiX(), scaleFactor * physicalDpiY()); ... setPixmap(QPixmap::fromImage(image)); }
In the above code we pass the resolution of the image to be created, multiplied by a scale factor that the user controls via the example's user interface. We have to be careful with the range of scale factors available because it is easy to request extremely large images. In practice, we restrict the user's choice to a set of predefined scale factors.
[править] Поиск текста
Одной из многих полезных функций, которые предоставляет Poppler, является возможность находить определенные текстовые строки в PDF-документах. Поскольку формат PDF предназначен для хранения печатаемых, а не редактируемых документов, не всегда легко получить доступ и восстановить оригинальный текст автора. Однако Poppler отлично справляется с поиском текста во многих документах, и мы можем использовать эту возможность в нашем примере.
API для поиска текста предоставляет обычные функции, такие как поиск без учета регистра и направленный поиск, а также возвращает информацию о положении любого расположенного текста на странице— поскольку PDF является форматом отображения, это действительно единственная полезная информация о тексте, которую мы можем получить. Эта информация может быть использована для указания места начала последующего поиска.
В основном, код для выполнения прямого поиска на заданной странице выглядит следующим образом:
bool found = page->search(text, searchLocation, Poppler::Page::NextResult, Poppler::Page::CaseInsensitive);
Here, searchLocation is a QRectF object that indicates where the search should start from on the given page. Initially, when we perform a search, we just pass a default constructed QRectF object to start from the page origin.
The rectangle we obtain from the Page::search() function can be used when we render the page to highlight the located text and scroll the view to make sure it is visible. However, the position and dimensions of the rectangle are given in points (1 inch = 72 points), so we need to transform the rectangle to cover the correct area on-screen.
Searching through a document for a piece of text is slightly more involved than just a single function call. We'll look at this in more detail later.
[править] Извлечение текста
Since the mapping between the author's original text and its location on-screen may be purely visual, it is difficult to automate the extraction of text from PDF files, though there are tools that try very hard to achieve this.
Many document viewers let the user select and export text by making them select a region on-screen, giving the application something to work with, and Poppler supports this approach by providing a function that returns a string for a given rectangle that we call like this:
QString text = doc->page(currentPage)->text(selectedRect);
The method we use is somewhat different to this. We'll cover it in more detail later.
[править] Пример в подробностях
Having covered the basics of displaying pages, searching, and extracting text from documents, let's take a closer look at how our example uses these features.
We provide two functions to search for text strings supplied by the user via the user interface. For forwards searching, we start by looking for strings on the current page, beginning at the current search location, then try each following page until the end of the document.
QRectF DocumentWidget::searchForwards(const QString &text) { int page = currentPage; while (page < doc->numPages()) { if (doc->page(page)->search(text, searchLocation, Poppler::Page::NextResult, Poppler::Page::CaseInsensitive)) { if (!searchLocation.isNull()) { showPage(page + 1); return searchLocation; } } page += 1; searchLocation = QRectF(); }
If we reach the end of the document without finding anything, we search from the beginning until we reach the current page.
page = 0; while (page < currentPage) { searchLocation = QRectF(); if (doc->page(page)->search(text, searchLocation, Poppler::Page::NextResult, Poppler::Page::CaseInsensitive)) { if (!searchLocation.isNull()) { showPage(page + 1); return searchLocation; } } page += 1; } return QRectF(); }
As well as rendering pages at different scales, as shown earlier, we would like to highlight the results of searches. To do this, we insert some code to paint on the image obtained from the current page, using a matrix to map the rectangle onto the image.
QMatrix DocumentWidget::matrix() const { return QMatrix(scaleFactor * physicalDpiX() / 72.0, 0, 0, scaleFactor * physicalDpiY() / 72.0, 0, 0); } void DocumentWidget::showPage(int page) { ... QImage image = doc->page(currentPage)->renderToImage( scaleFactor * physicalDpiX(), scaleFactor * physicalDpiY()); if (!searchLocation.isEmpty()) { QRect highlightRect = matrix().mapRect( searchLocation).toRect(); highlightRect.adjust(-2, -2, 2, 2); QImage highlight = image.copy(highlightRect); QPainter painter; painter.begin(&image); painter.fillRect(image.rect(), QColor(0, 0, 0, 32)); painter.drawImage(highlightRect, highlight); painter.end(); } setPixmap(QPixmap::fromImage(image)); }
The result of this additional effort is shown in the following image—the located text is displayed normally while the rest of the page is slightly darker.
In our example, we allow the user to draw a selection onto the page by reimplementing three of the mouse event handler functions in our DocumentWidget. In these we maintain a QRubberBand object to keep track of the area selected, following the pattern shown in the QRubberBand documentation.
The mouse release event handler is where we start the process of selecting text:
void DocumentWidget::mouseReleaseEvent(QMouseEvent *) { ... if (!rubberBand->size().isEmpty()) { QRectF rect = QRectF(rubberBand->pos(), rubberBand->size()); rect.moveLeft(rect.left() - (width() - pixmap()->width()) / 2.0); rect.moveTop(rect.top() - (height() - pixmap()->height()) / 2.0); selectedText(rect); } rubberBand->hide(); }
When the user releases the mouse button, we create a rectangle with coordinates relative to the top-left corner of the image within the label, and we pass this to the selectedText() function which is responsible for informing the rest of the application about any text it finds.
As noted earlier, the Poppler Page class provides a function to return text within a rectangle in a document. However, in selectedText(), we use a more convoluted method to show how much information we can obtain about a document.
We begin by mapping the selection rectangle onto the page, using the inverse of the matrix we used to highlight search results, before obtaining a list of TextBox objects, each of which describes a piece of text on the page.
void DocumentWidget::selectedText(const QRectF &rect) { QRectF selectedRect = matrix().inverted() .mapRect(rect); QString text; bool hadSpace = false; QPointF center; foreach (Poppler::TextBox *box, doc->page(currentPage)->textList()) { if (selectedRect.intersects(box->boundingBox())) { if (hadSpace) text += " "; if (!text.isEmpty() && box->boundingBox().top() > center.y()) text += "\n"; text += box->text(); hadSpace = box->hasSpaceAfter(); center = box->boundingBox().center(); } } if (!text.isEmpty()) emit textSelected(text); }
We test whether each piece of text lies within the selection and append it in a QString if it does. We also perform some elementary checks to see if we can cleverly insert newline characters in appropriate places.
Note that, while we're satisfied with obtaining whole pieces of text (typically words in a sentence), recent versions of Poppler allow the individual characters in TextBox objects to be located.
In the user interface, when the user selects some text, we display it in a text browser so that it can be copied and pasted elsewhere.
[править] Сборка примера
The example is provided as a standard Qt project with a simple pdfviewer.pro file. Because there is a certain amount of freedom associated with where you can install the Poppler library and header files on your system, you will need to modify this file to use the correct paths.
On Ubuntu 8.04 with the libpoppler-qt4-dev package installed, the appropriate paths are as follows:
INCLUDEPATH += /usr/include/poppler/qt4 LIBS += -L/usr/lib -lpoppler-qt4
Other Linux distributions may install these files in different locations, and developers on other platforms may find it easier to build the library alongside the example instead of installing it.
[править] Прочие возможности и улучшения
Our PDF viewer example only uses the most basic features of the Poppler library. Since many documents use features like encryption, slideshow transitions, tables of contents and annotations, the viewer applications that use Poppler to render documents rely on the library's support for these features.
Poppler includes a number of low level features that are useful for the purpose of analysing PDF files. Access to the list of fonts used in a document and the font data itself can be useful when preparing documents for publication.
Access to the body of text in a document is useful to developers looking to index documents for text mining and subsequent analysis. However, as noted earlier, this might be of limited use for some documents. A good summary of the issues surrounding text extraction can be found on the following page:
http://www.glyphandcog.com/textext.html
Information that is not part of the visible document is also available via the Poppler API. Annotations, scripts (typically written in JavaScript) and the URLs for hyperlinks can all be obtained, though it is up to the application developer to present this information in a meaningful way.
Like Qt's QPrinter class, Poppler is also able to write PostScript files, so we could easily add support for file export and conversion. Recent versions also support PDF output, and this opens the door to the use of the library for PDF manipulation. In fact, since the library allows us to examine documents without having to display pages, it is possible to write command line tools to handle documents, and a number of these are supplied with Poppler.
[править] Смотрите также
Poppler is a hosted on freedesktop.org, a site dedicated to Free and Open Source desktop projects:
http://poppler.freedesktop.org/
Poppler's Qt 4 frontend has its own documentation, which can be obtained via the project's Wiki:
http://freedesktop.org/wiki/Software/poppler
Popular PDF viewers which use Poppler include Okular and Evince for the KDE and GNOME desktop environments:
http://www.gnome.org/projects/evince/
The Xpdf application, from which Poppler is derived, can be obtained from the following Web site:
The source code for the example described in this article can be obtained from the Qt Quarterly Web site.