Webscraping using Selenium with nodeJs

Posted By : Parveen Kumar Yadav | 27-Jun-2017

For webscarpping you can go with phantom.js, nightmare.js etc. But in some case while using phantomJs or nightmare some server detect that the call is from the bot not by original user so in some case you can avoid that by using selenium not worked in all cases but yes this is one of the option to do scarping. It is a web testing framework that automatically loads the web browser to mimic a normal user. Once a page loads, you can scrape the content. For using selenium in your project you need to follow the steps:-

npm install selenium-standalone@latest -g
selenium-standalone install
selenium-standalone start

 

you can check the document for this in:-

https://www.npmjs.com/package/selenium-standalone
 

After that you need to install selenium web-driver

npm install selenium-webdriver
 

For detail description of installing and usage you can go through with the link:-

https://www.npmjs.com/package/selenium-webdriver

 

if you will get the following error:-

Error: The geckodriver executable could not be found on the current PATH. Please download the latest version from https://github.com/mozilla/geckodriver/releases/WebDriver and ensure it can be found on your PATH.

 

than you need to download the latest version of geckodriver or first check your path also. If you are using Ubuntu than you can directly install the geckodriver from the following link:-

https://askubuntu.com/questions/870530/how-to-install-geckodriver-in-Ubuntu
 

After that you also need to install the compatible firefox version for that you can download easily via following link:-

https://askubuntu.com/questions/661186/how-to-install-previous-firefox-version --> install any version Firefox

 

That issue is related to the version of Firefox and also the version we are using for geckodriver, so i upgrade my Firefox browser to the stable version i.e. 51.0.1 and also upgrade driver to 0.16.1 and set again the PATH in Bashrc after that the issue we were facing was resolved. Now if all works fine than you can get the html content of any webpage via the pageSource property.

driver = webdriver.Firefox();
driver.get("http://example.com");
html = driver.getPageSource();
 

in this way you can get the source of page using selenium web driver in NodeJS.

Hope this will help. Thanks!

About Author

Author Image
Parveen Kumar Yadav

Parveen is an experienced Java Developer working on Java, J2EE, Spring, Hibernate, Grails ,Node.js,Meteor,Blaze, Neo4j, MongoDB, Wowza Streaming Server,FFMPEG,Video transcoding,Amazon web services, AngularJs, javascript. He likes to learn new technologies

Request for Proposal

Name is required

Comment is required

Sending message..