In 2018 we were doing a research project, and we needed to know if a name was male or female. After Googling for hours for 'baby name lists', 'name databases' and 'name datasets' we discovered that there wasn't a complete name database for all countries with first names and gender. Most name database layouts we found different per country, were incomplete or contained non-existing names. That is why we created Name Census, the most comprehensive name database in the world!
Looking back, it took an enormous effort and a lot of patience to create this name database. To compile it we reached out to governments, statistical agencies and gathered open data from different resources. We received all kinds of files with different layouts and files we found online were incomplete and had character encoding issues. We restructured the files and imported everything into this standardized first name and surname database.
If you want to use our Name database in an online service, research or scientific project, it is important to understand how our database was created. In the paragraphs below we try to make as much clear as possible about how we worked and on what we based the list of names and results.
A census is an enumeration of people, houses, firms, or other important items in a country or region at a particular time. That is the reason why we called our service Name Census. To get all the first names and surnames we reached out to governments, statistical agencies and gathered open data from different resources. We used 22.055.118 social media profiles that where publicly available to cross-reference and count each first name and surname per country. This way we were sure that the names in our name database are actually used, and we could create our popularity metric.
European and North American countries are well organised and have governmental agencies or statistical bureaus that register the names and gender of newborn babies. In many countries these baby name databases are publicly available in the form of open data. During the first year we reached out to many official institutions and requested the lists of names of newborn babies and their gender. Many European and North American countries delivered the data within a week. Next to the governmental agencies or statistical bureaus there are also many open data initiatives like:
Unfortunately not every country in the world was able to deliver a lists of names of newborn babies and their gender. But that was not necessary. We discovered that we didn't need to have all the official first names and surnames from all countries. Spanish is an official language in many other countries like: Mexico, Colombia and Argentina. We were able to use the official Spanish first name and surname database for the name census in all Spanish-speaking countries as well. We applied the same logic for French, English and Portuguese name databases as well.
By using the official name databases of dozens of European countries and social media we were able to derive the name census for many other countries.
Validating on Social Media
Our name database is created by first names and surnames obtained from governments and cross-referencing with millions of names from publicly available social media profiles. We received the official name database in 31 countries. We took all those names and used 139.388.346 social media profiles that where publicly available to cross-reference and count each name per country. This meant we needed to know at least three things from each social media profile. It needed to have a:
- First name
- Location (e.g.: Paris, France)
We only wanted to use social media profiles with complete names so we were certain that a profile was from a real person. We used our name parsing software to split the complete name into components like: first name, middle names and surname.
Because we compiled a name database per country we also needed to know from what country a social media profile originated. In order to do that we create a "city parser" that could take in a location string and match it to an official location and country code. For example, if a profile had "The World" then we could not map it to an actual location, so we didn't use the name. If a profile had the location string "The bay area" we knew it was somebody from the United States (US) and if it was "Berlin und Frankfurt" we knew it had to be Germany (DE).
Eventually only 22.055.118 of the 139.388.346 social media profiles had a complete name and valid location.
Name database in CSV, SQL and JSON format
Our name database is available in CSV, SQL and JSON format. These file types are very often used for exchanging data between applications. The files are encoded using the UTF-8 character encoding standard. Each row contains a name and has a few additional columns like gender and popularity. You can download the Name Census top 100 from Github or Kaggle to get a preview of the format.
The following table shows an example of the first name database columns.
The following table shows an example of the surname database columns.
With 1.507.690 validated first names and 3.251.185 validated surnames from 139 countries Name Census is the world's most comprehensive name database that is available for download. Because the name database is available in CSV, SQL and JSON format it is easy to import it into any database. To get an idea of the possibilities of name list we listed a few typical use cases.
Auto complete forms
- Shorten contact, registration and order forms
- Get more conversions (people don't like to fill in forms)
- Get more information out of your form (gender and nationality)
Create a local name parser
Do you have a website where people can sign up, order or submit a question via a contact form? You can use our name parsing software to check if a name exists, is not made up or misspelled. If you don't want to depend on our API you can always buy our name list and create a local name parser.
- Split up the first and last name of your customers
- Make sure names are not misspelled
- Get more information from your customers (gender and nationality)
We always keep on improving our databases, software and website. This changelog gives a brief overview of all the changes we had and progress we made.