Webscraping using Selenium with nodeJs
Posted By : Parveen Kumar Yadav | 27-Jun-2017
For webscarpping you can go with phantom.js, nightmare.js etc. But in some case while using phantomJs or nightmare some server detect that the call is from the bot not by original user so in some case you can avoid that by using selenium not worked in all cases but yes this is one of the option to do scarping. It is a web testing framework that automatically loads the web browser to mimic a normal user. Once a page loads, you can scrape the content. For using selenium in your project you need to follow the steps:-
npm install selenium-standalone@latest -g selenium-standalone install selenium-standalone start
you can check the document for this in:-
https://www.npmjs.com/package/selenium-standalone
After that you need to install selenium web-driver
npm install selenium-webdriver
For detail description of installing and usage you can go through with the link:-
https://www.npmjs.com/package/selenium-webdriver
if you will get the following error:-
Error: The geckodriver executable could not be found on the current PATH. Please download the latest version from https://github.com/mozilla/geckodriver/releases/WebDriver and ensure it can be found on your PATH.
than you need to download the latest version of geckodriver or first check your path also. If you are using Ubuntu than you can directly install the geckodriver from the following link:-
https://askubuntu.com/questions/870530/how-to-install-geckodriver-in-Ubuntu
After that you also need to install the compatible firefox version for that you can download easily via following link:-
https://askubuntu.com/questions/661186/how-to-install-previous-firefox-version --> install any version Firefox
That issue is related to the version of Firefox and also the version we are using for geckodriver, so i upgrade my Firefox browser to the stable version i.e. 51.0.1 and also upgrade driver to 0.16.1 and set again the PATH in Bashrc after that the issue we were facing was resolved. Now if all works fine than you can get the html content of any webpage via the pageSource property.
driver = webdriver.Firefox(); driver.get("http://example.com"); html = driver.getPageSource();
in this way you can get the source of page using selenium web driver in NodeJS.
Hope this will help. Thanks!
Cookies are important to the proper functioning of a site. To improve your experience, we use cookies to remember log-in details and provide secure log-in, collect statistics to optimize site functionality, and deliver content tailored to your interests. Click Agree and Proceed to accept cookies and go directly to the site or click on View Cookie Settings to see detailed descriptions of the types of cookies and choose whether to accept certain cookies while on the site.
About Author
Parveen Kumar Yadav
Parveen is an experienced Java Developer working on Java, J2EE, Spring, Hibernate, Grails ,Node.js,Meteor,Blaze, Neo4j, MongoDB, Wowza Streaming Server,FFMPEG,Video transcoding,Amazon web services, AngularJs, javascript. He likes to learn new technologies