In 2018, our research project hit a roadblock: we needed to determine gender based on names across different countries. Despite extensive searches for 'baby name lists', 'name databases', and 'name datasets', we discovered a significant gap in global name data. Existing databases were often country-specific, incomplete, or filled with inaccuracies. This led us to create Name Census, now the world's most comprehensive name database!

Building our first names and surnames database demanded a lot of effort and patience. We reached out to governments and statistical agencies worldwide and gathered a lot of open data from diverse sources. We faced challenges like inconsistent file formats, partial name lists and character encoding issues in many online resources. With a lot of hard work, patience and common sense we developed a unified database of first names and surnames.

For those considering our Name database for online services, research, or (commercial) projects, understanding its creation is important. In the following sections, we explain our methodology, data sources, and the principles behind our comprehensive name databases.

Methodology

A census is a comprehensive count of a population in a specific area at a given time. This concept inspired the name of our service, Name Census. To build our database of first names and surnames we collaborated with governments and statistical agencies. Additionally, we gathered open data from various resources and analyzed 22.055.118 publicly available social media profiles. This approach ensured the accuracy of our name database and allowed us to create a reliable popularity metric.

Governments and statistical agencies

European and North American countries have well-established systems for recording newborn names and genders. Many make this data publicly available for everybody to download. In our first year we contacted numerous governments and statistical agencies and requested lists of newborn names and their genders. We received data from many European and North American countries within a few weeks. We also leveraged open data initiatives such as:

While not every country provided comprehensive name data, we discovered an efficient and accurate workaround. For instance, Spanish is an official language in many countries like Mexico, Colombia, and Argentina. This allowed us to apply the Spanish name database across all Spanish-speaking countries. We used similar strategies for French, English, and Portuguese name databases.

By combining official European name databases with social media data, we successfully extrapolated name censuses for many other countries.

Validating through social media

To ensure accuracy, we cross-referenced government-sourced names with millions of publicly available social media profiles. We obtained official name databases from 31 countries and analyzed 139.388.346 social media profiles. Each profile needed to provide:

  • First name
  • Surname
  • Location (e.g., Paris, France)

To ensure data integrity, we only used profiles with complete names, confirming their authenticity. Our proprietary name parsing software broke down full names into components: first name, middle names, and surname. For country-specific databases, we developed a "city parser" to match location strings with official locations and country codes. For example, "The Bay Area" indicated a US profile, while "Berlin und Frankfurt" signified a German profile. Ultimately, 22.055.118 out of 139.388.346 social media profiles met our criteria of having a complete name and valid location. We used these social media profiles to calculate the frequency of the first name and the surname.

Name database in CSV, SQL and JSON format

Our name database is available in CSV, SQL, and JSON format. These formats ensure an easy integration with various applications and systems. All files are encoded using the industry-standard UTF-8 character encoding, guaranteeing compatibility and accurate representation of international names. To explore the structure and content of our database, you can download a sample of the Name Census top 100 from Github or Kaggle .

The following table shows an example of the first name database columns.

Name ASCII Country code Official*1 Gender*2 Unisex Frequency*3 Country Rank*4
Сергей Sergej RU Y M N 4 4.573
Анна Anna RU Y F N 6 7.214
Björn Bjorn DE Y M N 464 837
Jürgen Jurgen DE Y M N 39 1.355
Hélène Helene FR Y F N 11 829

The following table shows an example of the surname database columns.

Name ASCII Country code Official*1 Gender*2 Frequency*3 Country Rank*4
Nuñez nunez ES Y 1.903 45
Bhardwaj bhardwaj IN Y 1.826 48
иванов ivanov RU Y M 1.882 1
Jónsdóttir jonsdottir IS N 63 2
Genç genc TR Y 1.773 54

*1 Our name database integrates official name lists from different governments and statistical agencies worldwide. When a name appears on social media in a country and matches an official list, we mark it as "official." If we encounter a name on local social media that's official in another country, we include it but mark it as 'unofficial'. This methodology ensures comprehensive coverage while maintaining the highest standards of data integrity.

*2 In our name database, the gender field indicates whether a name is typically male (M) or female (F). However, we recognize the complex nature of name-gender associations across different cultures and countries. Some names are gender-specific in one culture but gender-neutral in another. Take "Robin" for instance, which can be considered male, female, or neutral depending on the region. Furthermore, in certain cultures, surnames reveal th gender of a person, like in Russia, where last names often have distinct male and female forms. Our database captures these nuances, offering a comprehensive and culturally sensitive view of global name-gender associations.

*3 The frequency column in our database reflects the occurrence of each name on social media within specific countries. To establish this, we analyzed 22.055.118 social media profiles where we also knew the country. Each time if we encountered a name, we incremented its frequency count. This approach allows us to calculate unique country-specific frequencies. For example, the name "John" shows different frequencies in the United States compared to Germany.

*4 Recognizing that name frequency often follows a non-linear distribution, we introduced the concept of country rank to provide additional context. This system sorts names by frequency within each country, assigning rank 1 to the most popular name and distributing ranks down to the least common names. By doing so, we offer a more intuitive understanding of a name's popularity within its specific cultural context. This ranking system complements the raw frequency data, giving users a comprehensive view of name popularity across different regions and cultures.

Use cases

Name Census stands as the world's most comprehensive downloadable name database, containing 1.507.690 validated first names and 3.251.185 validated surnames from 139 countries. Our database's availability in CSV, SQL, and JSON formats ensures seamless integration with any system. Explore below how our extensive name database can revolutionize your applications and user experiences.

Intelligent form auto-completion

Enhance your website's user experience by implementing smart name auto-completion in your forms. Leveraging our robust name database, coupled with a memory table and JavaScript, you can create a Google-like auto-suggest feature for name fields. As users type, the system presents the most popular name options, significantly reducing input errors and automatically inferring gender information. This smart feature not only streamlines the user experience but also enhances data accuracy and completeness.

  • Streamline contact, registration, and order forms for improved user engagement
  • Boost conversion rates by minimizing form abandonment
  • Enrich user profiles with inferred data on gender and potential nationality

Advanced local name parsing

For websites handling user signups, orders, or contact form submissions, our sophisticated name parsing capabilities offer invaluable benefits. While our name parsing software provides a ready-to-use solution, you can also acquire our comprehensive name list to develop a customized, local name parsing system. This approach ensures name validity, detects fabricated names, and identifies potential misspellings, all while maintaining complete control over your data processing.

  • Accurately parse full names into parts (first name, surname, nickname, etc.)
  • Implement robust name verification to maintain data integrity
  • Extract valuable insights such as likely gender and nationality from submitted names

Changelog

We always keep on improving our databases, software and website. This changelog gives a brief overview of all the changes we had and progress we made.

Date Changes
2024-07-06
  • Improved order flow and payment using Stripe.
  • Databases can now be downloaded per country or as a single database.
  • Database links are sent securely via email for different database formats.
2023-03-13
  • Added additional file formats SQL and JSON in addition to CSV file.
  • Improved the content of all pages to make the name database easier to find.
  • Improved the order process.
  • Implemented recaptcha on every page that uses a form to block bots.
2023-02-27
  • Updated first name database containing 1.507.690 names from 139 countries.
  • Started exporting the surname database and adding it to our service.
  • Surname database created containing 3.251.185 names from 139 countries.
2022-10-23
  • Updated first name database containing 1.467.445 names from 141 countries.
2021-01-30
  • Improved the delivery of databases via visible link in confirmation email.
  • Updated first name database containing 995.718 names from 113 countries.
2020-08-31
  • Launched first version of the website offering official name lists.
  • Added Stripe as a payment method to offer an easy payment flow.
  • Updated first name database containing 859.257 names from 104 countries.