Ruby/Network/hpricot HTML Parsing

Материал из Wiki.crossplatform.ru

Перейти к: навигация, поиск

Содержание

Hpricot can work directly with open-uri to load HTML from remote files

require "rubygems"
require "hpricot"
require "open-uri"
doc = Hpricot(open("http://www.rubyinside.ru/test.html"))
puts doc.search("h1").first.inner_html



require "rubygems"

require "hpricot"
html = <<END_OF_HTML
<html>
<head>
  <title>This is the page title</title>
</head>
<body>
  <h1>Big heading!</h1>
  <p>A paragraph of text.</p>
  <ul><li>Item 1 in a list</li><li>Item 2</li><li class="highlighted">Item
3</li></ul>
</body>
</html>
END_OF_HTML
doc = Hpricot(html)
puts doc.search("h1").first.inner_html



Search for the first instance of an element only

require "rubygems"
require "hpricot"
require "open-uri"
doc = Hpricot(open("http://www.rubyinside.ru/test.html"))
puts doc.search("h1").first.inner_html
 
list = doc.at("ul")
list.search("li").each do |item|
  puts item.inner_html
end



Using a combination of search methods, search for the list within the HTML and then extract each item

require "rubygems"
require "hpricot"
require "open-uri"
doc = Hpricot(open("http://www.rubyinside.ru/test.html"))
puts doc.search("h1").first.inner_html
 
list = doc.search("ul").first
list.search("li").each do |item|
  puts item.inner_html
end



Using CSS classes to find certain elements

require "rubygems"
require "hpricot"
require "open-uri"
doc = Hpricot(open("http://www.rubyinside.ru/test.html"))
puts doc.search("h1").first.inner_html
list = doc.at("ul")
highlighted_item = list.at("/.highlighted")
puts highlighted_item.inner_html