NAV Navbar
python

Proxycurl

Proxycurl is a distributed crawling service that helps to circumvent most (if not all) rate-limiting techniques employed by complex websites.

Introduction to Proxycurl's API

Rate limit

Every user will have your own credential with a rate limit dictated by your active plan. The rate limit will have a sliding window of 60 seconds.

Authentication

Proxycurl uses a pair of username and password for authentication, for which the values should be provided in the request via HTTP BASIC AUTH.

HTTP Basic Auth with python requests library:

import requests

API_HOSTNAME = 'https://replace.me.with.real.hostname.com'
r = requests.post(API_HOSTNAME, auth=HTTPBasicAuth('user', 'pass'))

Making a proxycurl request

When you make a proxycurl request, a random node on our network will open a page with the specified url, and returns with the HTML content of the page. Proxycurl requests are sychronous by default. It will block until the API returns the result.

Fast Crawl

A fast crawl is like a curl. It makes a request to the URL and returns the response immediately without processing it. This is the fastest method of crawling websites.

HTTP Request

POST https://replace.me.with.real.hostname.com

import requests
import json

API_HOSTNAME = 'https://replace.me.with.real.hostname.com'
payload = {
    'id': 'unique-id',
    'url': 'https://api.ipify.org   ',
    'type': 'xhr',
    'headers': {'LANG', 'en'},
    'method': 'get'
}
r = requests.post(API_HOSTNAME, auth=HTTPBasicAuth('user', 'pass'), data=json.dumps(payload))

Query Parameters

Parameter Required Default Possible values Description
id true A random non-colliding string to identify this request. id does nothing for synchronous proxycurl requests.
url true The URL to crawl.
type true xhr xhr Set type to xhr for a fast crawl request
headers false null A dictionary by which to override the default request headers. For example, {'lang': 'en'}will replace theLANGheader with a value ofen`.
method false get get, post, delete, put The HTTP method for the node to make the request with.
data false null Content to be placed into the content body of the request.
dataType false html xml, json, script, html
contentType false application/x-www-form-urlencoded; charset=UTF-8 When sending data to the server, use this content type.

Response Parameters

Response messages are JSON messages with the following parameters.

Parameter Description
curl_id The id that is provided in the given request
data The html of the page that is crawled
status_code Status code of the response, for example: 200

Browser Crawl

A browser crawl loads a url in the browser, and loads the page just like how a normal browser will, with javascript and all assets downloaded and running. Then it returns the html content of the page. Browser crawls works just like headless browser page loads, except that browsers comes loaded with cookies of real users and unique residential IP addresses.

HTTP Request

POST https://replace.me.with.real.hostname.com

import requests
import json

API_HOSTNAME = 'https://replace.me.with.real.hostname.com'
payload = {
    'id': 'unique-id',
    'url': 'https://api.ipify.org   ',
    'type': 'browser',
    'headers': {'LANG', 'en'},
}
r = requests.post(API_HOSTNAME, auth=HTTPBasicAuth('user', 'pass'), data=json.dumps(payload))

Query Parameters

Parameter Required Default Possible values Description
id true A random non-colliding string to identify this request. id does nothing for synchronous proxycurl requests.
url true The URL to crawl.
type true xhr xhr Set type to xhr for a fast crawl request
headers false null A dictionary by which to override the default request headers. For example, {'lang': 'en'}will replace theLANGheader with a value ofen`.
timeout false 30 Value is in seconds. Time to wait for node to respond. Nodes might be die due to random reasons.
dom_read_delay_ms false 500 Value is in milliseconds. Time to wait after a page has loaded (with javascript rendered) before grepping the page's content and returning it. This is useful if the page is a javascript app by which page loaded does not mean that dom has finished rendering. We identify a page to be ready using JQuery's document.ready.

Response Parameters

Response messages are JSON messages with the following parameters.

Parameter Description
curl_id The id that is provided in the given request
data The html of the page that is crawled
status_code Status code of the response, for example: 200

Asynchronous Browser Crawl

A browser crawl can take a long time to complete because assets of websites can take a long time to be downloaded. In cases of such, an asychronous browser crawl request will be helpful.

Asynchronous browser crawls requre that you setup a listening HTTP server for responses to stream in after requests are made. Responses are not expected to return in order. So you should track requests with id and pair them appropriately with responses.

HTTP Request

POST https://replace.me.with.real.hostname.com

import requests
import json

API_HOSTNAME = 'https://replace.me.with.real.hostname.com'
payload = {
    'id': 'unique-id',
    'url': 'https://api.ipify.org   ',
    'type': 'browser',
    'headers': {'LANG', 'en'},
    'webhook': 'https://webhook-to-respond-to.com/new-resp',
}
r = requests.post(API_HOSTNAME, auth=HTTPBasicAuth('user', 'pass'), data=json.dumps(payload))

Query Parameters

Parameter Required Default Possible values Description
id true A random non-colliding string to identify this request.
url true The URL to crawl.
type true xhr xhr Set type to xhr for a fast crawl request
headers false null A dictionary by which to override the default request headers. For example, {'lang': 'en'}will replace theLANGheader with a value ofen`.
dom_read_delay_ms false 500 Value is in milliseconds. Time to wait after a page has loaded (with javascript rendered) before grepping the page's content and returning it. This is useful if the page is a javascript app by which page loaded does not mean that dom has finished rendering. We identify a page to be ready using JQuery's document.ready.
timeout false 30 Value is in seconds. Time to wait for node to respond. Nodes might be die due to random reasons.
webhook true A url for the node to contact when the request has completed.

Response Parameters

Response messages are JSON messages with the following parameters.

Parameter Description
curl_id The id that is provided in the given request
data The html of the page that is crawled
status_code Status code of the response, for example: 200

Status codes

The Proxycurl API uses the following status codes:

Status Code Meaning
200 OK Request has been completed sucessfully and the body contains the result.
202 Accepted Request has been accepted by a minion and result will be returned through the provided webhook.
401 Unauthorized Wrong or missing username/password
403 Forbidden User is not allowed to perform this action
429 Too Many Requests User has exceeded request rate limit. Use the value in "Retry-After" header to back off.
503 Service Unavailable All nodes are busy
502 Bad Gateway Assigned node encountered error while processing the request
504 Gateway Timeout Node did not reply within the given timeout