plugins – Need help creating asynchronous data scraper in WordPress
I need some help on a data scraping script I’ve been working on. I want to create a script through which content can be extracted from a page.
I fetched the page via wp_remote_get()
and wp_remote_retrieve_body()
. Using the DOMDocument() class, I was able to target the specific elements and upload them in database afterwards. Please tke a look at the following code –
$url = "https://www.myexampleurl.com/page-to-be-scraped";
$data = wp_remote_get( $url,
array(
'timeout' => 60
)
);
$body = wp_remote_retrieve_body( $data );
$dom = new DOMDocument();
$dom->loadHTML( $body );
$xpath = new DomXPath( $dom );
$xpath->registerNamespace( 'm', $url );
//The page has a center tag, believe it or not
$center = $dom->getElementsByTagName('center')->item(0);
//Targeting all links
$query = '//a';
$entries = $xpath->query( $query, $center );
$count = 1;
foreach ($entries as $entry) {
//The target elements have 'data-lightbox' attribute
$attr = $entry->attributes->getNamedItem( 'data-lightbox' );
//Uploading the sibling attribute to 'data-lightbox'
if ( !empty( $attr ) ) {
//The data fetched is uploaded to dtabase using this function
my_upload_file_by_url( $attr->previousSibling->nodeValue );
}
}
Now, what I need to know is how to make this request asynchronous. I tried AJAX but it gets timed out and throws an error.
Also, $dom->loadHTML( $body )
throws an error as follows –
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 241 in /Users/apple/Sites/indidev/wp-content/plugins/crawler/crawler.php on line 56
Also tried wp_schedule_single_event
function but it also doesn’t function as expected. Any pointers are appreciated!
PS – There are many pages that need to be scraped and simultaneously inserted in the database.
Leave an answer