Read our article on


A step by step guide on how to collect and preprocess satellite imagery data for machine learning algorithms

This article is the first part of a 2 part series which explores the uses of machine learning algorithms on satellite imagery. Here we will focus on the data collection and preprocessing steps while the second part will showcase the use of various machine learning algorithms.

This article will:

  • Give you a better understanding of satellite imagery
  • Show you what type of imagery can be collected open source
  • Guide you through the necessary steps to have “machine learning ready” images

    We will use the detection and classification of agriculture fields as an example. Python and jupyter notebook are used for the preprocessing steps. The jupyter notebook for reproducing the steps is available in github.
    This article assumes basic fundamentals in data science and python.

  • Computer vision is a fascinating branch of machine learning that has been quickly developing in past years. With public access to many off the shelf algorithm libraries (Tensorflow, OpenCv, Pytorch, Fastai …) as well as open source datasets (CIFAR -10, Imagenet, IMDB …) it’s very easy for a data scientist to get hands on experience on this topic.

    But did you know that you could also access open source satellite imagery ? Those types of imagery can be used in a wealth of use cases (Agriculture, Logistics, Energy …). However, they provide additional challenges for a data scientist due to their size and complexity. I have written this article to share some of my key learnings working with them.

    Step 1 — Select the satellite open data source

    First, you need to choose which satellite you want to use. Key features to watch out for here are :

    • Spatial Resolution: how many meters are covered per pixel, higher resolution will unlock some use cases and give better performance for most algorithms.

    • Instruments available: the instruments will capture different spectral bands, both from the visible and invisible spectrum. Bands usefulness will vary depending on the use case.

    • Temporal resolution : time between two visits on a given spot on earth. Note that many satellite systems actually include several satellites and in that case the global temporal resolution is usually equal to that of one of its satellites divided by the number of satellites.

    • Radiometric resolution : the number of possible values the instrument captures. The higher, the more precise the measurements are.

    Here are satellites whose data are available free of charge :

    Benchmark of satellite systems

    Since our goal is to detect and identify agricultural crops, the visible spectrum (which is used to provide standard color images) should be quite useful. Additionally there are correlations between nitrogen level of crops and NIR (Near-infrared) spectral bands. As far as free options go, Sentinel 2 is currently the best in terms of resolution with 10 m for those spectral bands, so we are going to use this satellite.

    Of course, 10 m resolution is much lower compared to state of the art in satellite imagery with commercial satellites reaching a 30 cm resolution. Access to commercial satellite data can unlock many use cases, but despite increasing competition with new players entering this market, they remain highly priced and hence difficult to integrate in a scaled use case.

    Step 2 — Collect the satellite imagery data from Sentinel 2

    Now that we have selected our satellite source, the new step is to download the images. Sentinel 2 is part of the Copernicus Programme from the European Space Agency. We can access its data using the SentinelSat python API.

    We first need to create an account at the Copernicus Open Access Hub. Alternatively note that this website also provides a way to query images with a GUI.

    A way to specify the zones you want to explore is to provide the API with a geojson file, which is a json containing GPS coordinates position (latitude and longitude) of a zone. Geojson can be quickly created on this website.

    This way you can query all the Satellite images intersecting with the zone provided.

    Here we list all images available on a for a given zone and time range :

    api = SentinelAPI(

    shape = geojson_to_wkt(read_geojson(geojson_path))

    images = api.query(
    date=(date(2020, 5, 1), date(2020, 5, 10)),
    processinglevel = “Level-2A”,
    cloudcoverpercentage=(0, 30)

    images_df = api.to_dataframe(images)

    Note the use of arguments:

    • date: this range should be set keeping in mind the 6 day temporal resolution. Since we are going to detect crops we set it at a pre-harvest period

    • processinglevel : Level-2A is the most advanced level of preprocessing. For instance atmospheric corrections have already been performed on the images (to go more in depth on processing levels)

    • cloudcoverpercentage : Clouds will make some images difficult to exploit. We filter out images with more than 30 % as they are likely unusable


    Images_df is a pandas dataframe containing all images that match our query. We select one with a low cloud cover and download it using the api :

    uuid = “01d97f6b-8eb5-4edc-9261-10e3f4d437d0”

    Step 3 — Understand which spectral bands to use and manage quantization

    Once the zip file is downloaded we can see that it actually contains several images.

    This is because as we said in step 1, Sentinel 2 captures many spectral bands and not only Red Green and Blue and each band is captured in a single image.

    Also its instruments do not all have the same spatial resolution, so there is a subfolder for each different resolution 10, 20 and 60 meters.

    If we check the GRANULE/<image_id>/IMG_DATA folder we see that there is a subfolder for each different resolution: 10, 20 and 60 meters as the spectral bands. We are going to use those with max resolution 10m as they provide Band 2, 3, 4 and 8 which are Blue, Green, Red and NIR. Band 7, 8, 9 which are vegetation red edge (spectral bands between red and NIR that mark a transition in reflectance value for vegetation) could also be interesting for detecting crops but since they are lower resolution we are going to leave them aside now

    Here we load each band as a 2D numpy matrix:

    def get_band(image_folder, band, resolution=10):
    subfolder = [f for f in os.listdir(image_folder + “/GRANULE”) if f[0] == “L”][0] image_folder_path = f”{image_folder}/GRANULE/{subfolder}/IMG_DATA/R{resolution}m”
    image_files = [im for im in os.listdir(image_folder_path) if im[-4:] == “.jp2”] selected_file = [im for im in image_files if im.split(“_”)[2] == band][0]

    with”{image_folder_path}/{selected_file}”) as infile:
    img =

    return img

    band_dict = {}

    for band in [“B02”, “B03”, “B04”, “B08”]:
    band_dict[band] = get_band(image_folder, band, 10)

    A quick test to make sure everything is working is trying to recreate the true color image (ie RGB) and display it using opencv and matplotlib:

    img = cv2.merge((band_dict[“B04”], band_dict[“B03”], band_dict[“B02”]))

    Satellite Image — Color intensity issue (Copernicus Sentinel data 2020)

    However there is an issue with color intensity. This is because the quantization (number of possible values) differs from the standard for matplotlib which assumes either a 0–255 int value or a 0–1 float value. Standard images often use an 8 bit quantization (256 possible values) but Sentinel 2 images use a 12 bit quantization and a reprocessing converts the values to a 16 bit integer (65536 values).

    If we want to scale a 16 bit integer to an 8 bit integer we should in theory divide by 256 but this actually results in a very dark image since the whole range of possible values is not used. The max value in our image is actually 16752 out of 65536 and few pixels actually reach values higher than 4000 so dividing by 8 actually gives an image with a decent contrast

    img_processed = img / 8
    img_processed = img_processed.astype(int)

    Satellite Image — Right color intensity (Copernicus Sentinel data 2020)

    We now have a 3 dimension tensor of shape (10980, 10980, 3) of 8 bit integers in the form of a numpy array

    Note that :

    • We could go further by using a non linear scaling approach to improve contrast.

    • Casting as integer amounts to rounding the value so we are actually losing some information. This is fine for many use cases and improves processing speed but if precision is key it might be better to convert the reflectance value to a 0–1 float (a format also managed by most ML libraries) rather than a 0–255 integer.

    • For now we are leaving the NIR spectral band aside. This will help us test faster since many off-the shelf algorithms expect images with 3 color channels. But it could be interesting to reintegrate it at some point either by replacing one of the other channels or by customizing algorithms to expect 4 color channels.

    Step 4 — Split the images to machine learning ready sizes

    We could try applying an algorithm to our current image but in practice that would not work with most techniques due to its size. Even with strong computing power, most algorithms (and especially deep learning) would be off the table.

    So we must split the image into fragments. One question is how to set the fragment’s size.

    • Small fragments will improve performance.

    • Larger fragments reduce the number of objects that will be hard to detect due to being split into several images. However fragments that are too large can be an issue for some ML algorithms that rescale images or are not able to scale well past a certain number of objects in a given image.

    We decide to go for splitting each image in a 45 * 45 grid (45 being conveniently a divider of 10980)

    Past this parameter fixing, this step is straightforward using numpy array manipulation . We store our values in a dict using (x, y) tuples as keys.

    frag_count = 45
    frag_size = int(img_processed.shape[0] / frag_count)
    frag_dict = {}

    for y, x in itertools.product(range(frag_count), range(frag_count)):
    frag_dict[(x, y)] = img_processed[y*frag_size: (y+1)*frag_size,
    x*frag_size: (x+1)*frag_size, :]

    plt.imshow(frag_dict[(10, 10)])

    Fragment of the satellite image (Copernicus Sentinel data 2020)

    Step 5 — Link your data to the satellite images using GPS to UTM conversion

    Finally we need to link our images with the data we have on agricultural fields to enable a supervised machine learning approach. In our case we have data on agricultural fields in the form of a list of GPS coordinates (latitude, longitude) , similar to the geojson we used in step 2 and we want to be able to find their pixel position. However, for the sake of clarity, I will start by showcasing a pixel conversion to GPS coordinates and then show the reverse

    The metadata collected while requesting the SentinelSat API provides the GPS coordinates of the corners of the image (footprint column) and we want the coordinates of a specific pixels. Latitude and longitude which are angular values do not evolve linearly within an image so we cannot do a simple ratio using the pixel position. There are ways to solve this using math equations but off the shelf solutions already exist in python.

    One quick solution is to convert the pixel position to UTM (Universal Transverse Mercator) in which a position is defined by a zone as well as an (x, y) position (in meter unit) which does evolve linearly with the image. We can then use a UTM to Latitude / Longitude conversion provided in the utm library. To do this we first need to obtain the UTM zone and position of the upper left corner of the image which can be found in true color image’s metadata.

    #Obtaining meatadata
    transform =, driver=’JP2OpenJPEG’).transform
    zone_number = int(tci_file_path.split(“/”)[-1][1:3])
    zone_letter = tci_file_path.split(“/”)[-1][0] utm_x, utm_y = transform[2], transform[5]

    # Converting pixel position to utm
    east = utm_x + pixel_column * 10
    north = utm_y + pixel_row * – 10

    # Converting UTM to latitude and longitude
    latitude, longitude = utm.to_latlon(east, north, zone_number, zone_letter)

    The GPS position to pixel conversion is done using the reverse formulas. In that case we have to make sure that the zone obtained matches with our image. This can be done by specifying the zone. If our object is not present in the image this will result in an out of bound pixel value.

    # Converting latitude and longitude to UTM
    east, north, zone_number, zone_letter = utm.from_latlon(
    latitude, longitude, force_zone_number=zone_number

    # Converting UTM to column and row
    pixe_column = round((east – utm_x) / 10)
    pixel_row = round((north – utm_y) / -10)
    view raw


    We are now ready to use the satellite images for machine learning !

    In the next article of this 2 part series we will see how we can use those images to detect and classify fields surfaces using both supervised and unsupervised machine learning.

    Thanks for reading and don’t hesitate to follow the Artefact tech blog if you wish to be notified when this next article releases !

    You can find more about us and our projects on our Medium blog

    Artefact Newsletter

    Interested in Data Consulting | Data & Digital Marketing | Digital Commerce ?
    Read our monthly newsletter to get actionable advice, insights, business cases, from all our data experts around the world!