Joining a shapefile attribute table to a CSV file
Joining attribute tables to other database tables allows you to use a spatial data set to reference a dataset without any geometry, using a common key between the data tables. A very common use case for this is to join a vector dataset of census attributes to a more detailed census attribute dataset. That is the use case we will demonstrate here by linking a US census track file to a detailed Comma Separated Value (CSV) file containing more in-depth information.
Getting ready
For this recipe you will need a census tract shapefile and a CSV file containing the appropriate census data for the shapefile. You can download the sample dataset from the following URL:
https://github.com/GeospatialPython/Learn/raw/master/census.zip
Extract the data to a directory named /qgis_data/census
.
How to do it...
The join operation is quite involved. We'll perform this operation and save the layer as a new shapefile with the joined attributes. Then, we'll load the new layer and compare the field count to the original layer to ensure the join occurred:
- Start QGIS.
- From the Plugins menu, select Python Console.
- First, we'll load the county census track layer and validate it:
vectorLyr= QgsVectorLayer("/qgis_data/census/hancock_tracts.shp", "Hancock" , "ogr") vectorLyr.isValid()
- Now, we'll load the CSV file as a layer and validate it as well:
infoLyr = QgsVectorLayer("/qgis_data/census/ ACS_12_5YR_S1901_with_ann.csv", "Census" , "ogr") infoLyr.isValid()
- Now, we must add both layers to the map registry in order for the two layers to interact for the join. However, we will set the visibility to
False
so that the layers do not appear on the map:QgsMapLayerRegistry.instance().addMapLayers([vectorLyr,infoLyr], False)
- Next, we must create a special join object:
info = QgsVectorJoinInfo()
- The join object needs the layer ID of the CSV file:
info.joinLayerId = infoLyr.id()
- Next, we specify the key field from the CSV file whose values correspond to the values in the shapefile:
info.joinFieldName = "GEOid2"
- Then, we specify the corresponding field in the shapefile:
info.targetFieldName = "GEOID"
- We then set the
memoryCache
property toTrue
to speed up access to the joined data:info.memoryCache = True
- Now, we add the join to the layer:
vectorLyr.addJoin(info)
- Next, we will write out the joined shapefile to a new file on disk:
QgsVectorFileWriter.writeAsVectorFormat(vectorLyr, "/qgis_data/census/joined.shp", "CP120", None, "ESRI Shapefile")
- Now, we'll load the new shapefile back in as a layer for verification:
joinedLyr= QgsVectorLayer("/qgis_data/census/joined.shp","Joined" , "ogr")
- Next, verify that the field count in the original layer is 12:
vectorLyr.dataProvider().fields().count()
- Finally, verify that the new layer has a field count of 142 from the join:
joinedLyr.dataProvider().fields().count()
How it works...
This recipe reaches out to the very edge of the PyQGIS API, forcing us to use some workarounds. Most recipes for data manipulation can be performed programmatically without writing data to disk or loading layers onto the map. But joins are different. Because the QgsVectorJoinInfo
object needs the layer ID of the CSV layer we must add both layers to the map layer registry. Fortunately, we can do this without making them visible if we are just trying to write a data manipulation script.
There's more...
You can find an alternate method in a Processing Toolbox script which manually matches up the joined data in Python in the QGIS Join attributes table function. This script creates a whole new shapefile with the joined tables.