Ruby/Network/hpricot HTML Parsing
Материал из Wiki.crossplatform.ru
Содержание |
Hpricot can work directly with open-uri to load HTML from remote files
require "rubygems" require "hpricot" require "open-uri" doc = Hpricot(open("http://www.rubyinside.ru/test.html")) puts doc.search("h1").first.inner_html
require "rubygems"
require "hpricot" html = <<END_OF_HTML <html> <head> <title>This is the page title</title> </head> <body> <h1>Big heading!</h1> <p>A paragraph of text.</p> <ul><li>Item 1 in a list</li><li>Item 2</li><li class="highlighted">Item 3</li></ul> </body> </html> END_OF_HTML doc = Hpricot(html) puts doc.search("h1").first.inner_html
Search for the first instance of an element only
require "rubygems" require "hpricot" require "open-uri" doc = Hpricot(open("http://www.rubyinside.ru/test.html")) puts doc.search("h1").first.inner_html list = doc.at("ul") list.search("li").each do |item| puts item.inner_html end
Using a combination of search methods, search for the list within the HTML and then extract each item
require "rubygems" require "hpricot" require "open-uri" doc = Hpricot(open("http://www.rubyinside.ru/test.html")) puts doc.search("h1").first.inner_html list = doc.search("ul").first list.search("li").each do |item| puts item.inner_html end
Using CSS classes to find certain elements
require "rubygems" require "hpricot" require "open-uri" doc = Hpricot(open("http://www.rubyinside.ru/test.html")) puts doc.search("h1").first.inner_html list = doc.at("ul") highlighted_item = list.at("/.highlighted") puts highlighted_item.inner_html