The Genomic Formation of South and Central Asia – Some thoughts, Part 1

While we await a new preprint (or the final version) of Narasimhan et al. 2018 I’d like to comment on several random things related to it.

Bactria-Margiana Archaeological Complex (BMAC)

When Victor Sarianidi first excavated the BMAC sites, his hypothesis was that it represented a migration from Syro-Anatolia to the east. Then, as in the following decades more information became available from previous cultures in the area, the debate about its origin opened with the option of a more native development. We still don’t have DNA from the earliest neolithic times, but we do have samples from the 5th-4th mill. to compare to the later BA ones from the BMAC proper.

If we assume that the samples from Sarazm (c. 3500 BCE) represent a more “native” (but we don’t know if Mesolithic or only Neolithic) type, we can use them to first look at the other Eneolithic samples. Geoksiur_Eneolithic are loosely dated to 5000-2000 BCE, but they’re more “eastern” than the other 2 groups, so probably they are a bit older:

Sarazm_Eneolithic:I4290    42.4%
Ganj_Dareh_N    28.8%
Sarazm_Eneolithic:I4910    14.8%
Hajji_Firuz_ChL    12%
West_Siberia_N    2%
Seh_Gabi_ChL    0%

Distance 1.7415%

Then we have the Anau and the Parkhai samples, from the 4th mill.

Ganj_Dareh_N    42.8%
Sarazm_Eneolithic:I4290    33.2%
Seh_Gabi_ChL    13.8%
Sarazm_Eneolithic:I4910    6.2%
Hajji_Firuz_ChL    4%
West_Siberia_N    0%

Distance 2.2718%

Ganj_Dareh_N    35.8%
Sarazm_Eneolithic:I4910    29.4%
Seh_Gabi_ChL    21.4%
Sarazm_Eneolithic:I4290    13.4%
West_Siberia_N    0%
Hajji_Firuz_ChL    0%

Distance 2.1556%

So now let’s compare these to the later BMAC proper samples:

Seh_Gabi_ChL    39%
Geoksiur_Eneolithic    27%
Sarazm_Eneolithic:I4290    18.6%
Shahr_I_Sokhta_BA2    14%
West_Siberia_N    1.4%

Distance 1.1578%

Geoksiur_Eneolithic    36.6%
Seh_Gabi_ChL    33.8%
Shahr_I_Sokhta_BA2    15%
Sarazm_Eneolithic:I4290    12.2%
West_Siberia_N    2.4%

Distance 1.0491%

So what we see is that the BMCA did receive a good amount of migration from the west between the Eneolithic and the Bronze Age (34-39% from around West Iran) and a smaller amount of migration from the Indus Valley (~15%), but the native component is still  the largest. This is probably in agreement with more recent archaeology that has been able to establish the intensive cultural contacts between these 3 regions from the Eneolithic, with cultural exchange happening probably in every direction.


The Scythians and the language of the steppe

One interesting but not commented thing in the paper is that it better documents the genesis of the Scythians in Central Asia. We already knew that the Scythians were largely descended from Sintashta/Andronovo people, but also had some NE Asian (much more the Eastern Scythians than the Western ones) and some “southern” admixture. In this paper we might have references for those sources and early (presumably) proto-Scythian samples.

Taking a look at the Western Scythian/Sarmatian samples using Eurogenes Global 25 datasheets, we see that they are closest to samples from Kazakhstan c. 1600-1500 BCE like Taldysay_MLBA2, ID I4794, 1600-1400 BCE, Y-DNA J2a1h2 (same as Tepe_Hissar_ChL:I2337, Iran, 3641-3519 calBCE) or Kyzlbulak_MLBA2, ID I4784, 1618-1513 calBCE, Y-DNA Q1a2b2). These two samples can be modelled as:

Sintashta_MLBA     58.4%
Parkhai_EBA    16.4%
West_Siberia_N    10.4%
Tepe_Hissar_ChL    7.6%
Xibo    7.2%

Distance 2.9151%

Sintashta_MLBA    47.6%
Parkhai_EBA    27.3%
West_Siberia_N    24.7%
Xibo    0.4%
Tepe_Hissar_ChL    0%

Distance 2.2969%

(Here I should add that the reason why Parkhai_EBA works better than, say, Gonur1_BA is probably because of the very low level of AASI Parkhai_EBA, which might mean that the groups that admixed with the Andronovo people had less AASI than those within bigger urban centers, or some other geographical reason).

In turn, Sarmatians can be modelled mostly as a mix of the above samples and the Srubnaya people they encountered on their migration to the western steppe:

Kyzlbulak_MLBA2    43.1%
Srubnaya_MLBA    40.2%
Taldysay_MLBA2    7.4%
Xibo    6.1%
Armenia_EBA    3.2%

Distance 1.8709%

Or using the same sources as above:

Sintashta_MLBA    67.8%
Parkhai_EBA    15.8%
West_Siberia_N    9.2%
Xibo    7.2%
Tepe_Hissar_ChL    0%

Distance 1.9752%

The interesting thing about this is this genetic data is that it continues to provide more information for linguists to work with. Why? Because we can now more accurately tell the place and time when specific contacts or migrations happened, and that provides valuable information for linguistic research. To the point: we don’t know exactly how Indo-Iranian languages arrived to SC Asia, but we do know quite accurately that they were spoken there from at least 1800 BCE. If, for the sake of simplicity and to avoid controversies, we follow the steppe hypothesis, Indo-Iranian formed right there through the migration of Andronovo tribes, in the contact zone with BMAC. David Anthony, following Lubotsky, refers to 55 non-Indo-Iranian words borrowed into common Indo-Iranian, among them the words for bread, ploughshare, canal, brick, camel, ass, sacrificing priest, soma and Indra, and concludes:

The BMAC fortresses and cities are an excellent source for the vocabulary related to irrigation agriculture, bricks, camels and donkeys; and the phonology of the religious terms is the same, so probably came from the same source.¹

It’s really not the purpose here to debate whether this is right or not. It’s only to stress that there is ample consensus about Indo-Iranian being spoken at that time, in that place and not anywhere else. With the Rigveda being composed around 1500 BCE (or earlier) in the Punjab and/or Haryana, there really is no other option.

The Scythian language is poorly known and hardly attested. However, it seems quite clear that it was an Indo-Iranian language, most likely on the Iranian branch, and more related to East Iranian. However, given the date of the formation of the ancestors of the Scythians, we’re probably taking about a very early form of Indo-Iranian, on the Iranian branch, but still older than Avestan and close to the split with Indo-Aryan. And given that Vedic is older than Avestan, that language was more or less as close to Vedic as to Avestan, even if it was in the Iranian branch.

Knowing this is important because this language was spoken on the steppe for at least 1000 years, leaving an important substrate and presumably influencing neighbouring languages. Interesting in this respect is the relationship of this language with Balto-Slavic and Uralic. It is a task for linguists now to determine how this matches the linguistic data. Going again with the steppe hypothesis as a starting point, things would go more or less like this:

  1. A late PIE language, ancestral to both Balto-Slavic and Indo-Iranian splits (more or less coinciding with the split of R1a-Z645 into Z93 and Z283) c. 3000 BCE somewhere in Eastern Europe (western Ukraine, Poland, maybe Belarus…), with pre-Indo-Iranian moving east from there (carrying R1a-Z93) and then from Kazakhstan moving south to become proto-Indo-Iranian through contacts with BMAC.
  2. Scythians, already speaking an early form of Indo-Iranian move west all the way to the Pontic steppe, where they arrive somewhere after 1000 BCE (?). This language is spoken throughout the steppe until the arrival of Uralic and Turkic groups.

So this is the question for linguists: Are the similarities between Balto-Slavic and Indo-Iranian better explained by the first scenario alone, or by the second scenario alone, or by both, with two different layers of influence corresponding to each? And basically the same question could be asked for the relationship between proto-Uralic the hypothetic late PIE from the 3rd mill or the early Indo-Iranian from the late 2nd and 1st mill.

(And I should also add that while the first scenario is just hypothetical, the second one is based on known and verifiable data, so it should be taken into account in any case).

P.S: We’ve also got recently a few samples from Alans. Genetically, they belong to the north Caucasus and are different from the Scythians/Sarmatians. So while they also spoke and East Iranian language (Ossetian being its modern descendant) probably related to the one of the Scythians, it was probably not the same one.

To be continued…


1 – Anthony, D. (2007), The horse, the wheel, the language.

34 thoughts on “The Genomic Formation of South and Central Asia – Some thoughts, Part 1

  1. Amazing analysis Alberto, thank you. I have a question though, what is the explanation of Haji Firuz in the first 2 Eneolithic samples ?
    Kind Regards,


  2. @Zarzian

    Thanks. Without Mesolithic and Neolithic samples we can’t know with total certainty, but it seems that southern Central Asia was in more or less constant contact with Iran, first with an Early Neolithic wave (Ganj_Dareh_N), later a Chalcolithic one (Seh_Gabi_ChL, Hajji_Firuz_ChL) and by the Bronze Age even more Iranian (and this time also North Indian) influence.

    This is assuming that the Sarazm samples represent a native population, which could be right or wrong.

  3. Excellent job with the blog. If you have time can you try modelling a few modern day Indo-Iranian speakers with the Hissar and/or Shahr-e-Sukhteh samples?

  4. @Vara

    Thanks. I will write about that in the second part of this post in the next few days, with some models. I ran the ancient and modern samples when they became available here, in case you didn’t see these ones.

  5. “we don’t know exactly how Indo-Iranian languages arrived to SC Asia, but we do know quite accurately that they were spoken there from at least 1800 BCE”

    how do we know this? there is some evidence of IA speakers in Syria but I don’t know of anything else.

  6. @postneo

    With the evidence of the oldest texts in Indic and Iranian, their time and place of composition, is it even possible that Indo-Iranian languages were not spoken in SC Asia from at least 1800 BCE? Maybe stretching things to unrealistic limits you could push that date to a century or two later?

  7. hard dating of the vedas and Gathas is not possible. While its easy to locate them in Iran and India. They can be anywhere from 3000 to 800 bc with many interpolations.
    References to black metal etc are not hard dates. We know vedic gods were referred to in Syria.

  8. The vedic language and rituals/cults have had anomalous preservation but cannot be representative of all IA and could not have occurred in vacuum. Surely there were other IA /PIA dialects as well that died off. Given the paucity of non IE loans in vedic, at least some neighboring languages must have been IE.

  9. @postneo

    Sure, it’s difficult to have a hard date for those texts. For the Rigveda I think that 1500 BCE is more or less conservative and pushing it back further is when you can find more disagreement. There might be also someone who dates it to a significantly later period, but that’s more of an exception.

    So I guess I’m not really getting the point you’re trying to make. Taking all the data for Indic and Iranian it’s difficult not to accept that they were spoken in SC Asia at that time. It’s something we know “quite accurately”, even if we don’t have 100% certainty as if the texts were actually written at that time in some stone tablets or whatever.

    Though you are right that around 1500 an Indo-Iranian language was also spoken by the Mitanni in northern Syria.

  10. How realistic is it to expect Mitanni DNA?? Are there any burials decisively known as Mitanni elite?

  11. @Zarzian

    I think it’s going to be quite difficult to get such thing anytime soon. I don’t know of very specific Mitanni elite burials, so we’ll probably have to do with some generic Mitanni period burials with more or less rich good from some sites, which will always leave some doubts whether they were Mitanni, Hurrian or some older “native” people. And with the political situation in the area I don’t think that much progress can be made in the next few years.

  12. Hi Alberto. Good blog.

    The reason why Alans are more Caucasian imho is quite simple. Because they formed in S Caucasuss. From Scythians in Azerbaijan. Later they moved to Georgia and then crossed Caucasus to move into Steppe.

  13. @Aram

    Thanks. Yes, that makes sense. Also the closest modern populations in G25 are in this order: Kumyk, Kabardin, Balkar, Karachay, Cherkes, Adygei, Chechen, Azeri_Dagestan, North_Ossetian, Tabassaran, Avar.

    I want to try more models to check if the preference for Yamnaya over Sarmatians or Sintashta is persistent. If I remember correctly the Y-DNA we have from them is mostly G2a (?)

    The question is where did they get their language from. Scythians seem to be the more obvious source, but without very direct gene flow? Or is it possible that it came from somewhere else?

  14. One of the things that is necessary to relook into are the possible ancestry sources of steppe_mlba.

    I think it was Allentoft et al 2015 which had argued that considering how steppe_mlba/Sintashta had R1a-Z93, the source of Anatolian_N ancestry in them could be a more Eastern source.

    Narasimhan et al, for the 1st time had some substantial aDNA from the eastern regions. So I think it would be worthwhile to see now if steppe_mlba had received admixture from Chalcolithic or Bronze Age Central Asia. There is very clear evidence of BMAC artefacts being present at Sintashta, before the Andronovo horizon even took shape. Infact, Kuzmina had remarked, before C14 dates from BMAC, that the temples at BMAC sites such as Dzharkutan, Dashly etc were inspired from similar structures at Sintashta. Now that we know that BMAC is older than Sintashta, this similarity opens up a new dimension. It is unfortunate that the Reich team failed to look into this angle.

    Can you check it out Alberto if steppe_mlba had any Central Asian input ? I would be grateful.

  15. @Jaydeep

    Yes, Sintashta is a surprising culture in the steppe, with the fortified settlements, some advanced metalworking in each household, chariots,… And even its pottery shows BMAC influence from the start.

    But genetically they’re quite clearly “European BA”-like, and I can’t detect any other admixture in them. Adding some no-AASI samples from SC Asia:

    CWC_Baltic_early 42.9%
    Globular_Amphora 25.4%
    Poltavka 15.9%
    Yamnaya_Samara 10.8%
    CWC_Baltic 3.7%
    West_Siberia_N 1.3%
    Sarazm_Eneolithic:I4290 0%
    Geoksiur_Eneolithic:S8530.E1.L1 0%
    Tepe_Anau_Eneolithic:I4087 0%
    Trypillia 0%

    Distance 1.3008%

    It’s only after the contacts really intensified when we start to see SC Asian admixture in the steppe (from 1600 BCE onwards). For the admixture going the other way around and the rest of the topics in the paper I’ll be talking in the second part of this post. Soon, I hope!

  16. Yes sintashta is fascinating
    The non -“outliers” are even more “”European”” than CWC, due to extra MNE. Allentofts conclusions hold
    So where did they acquire their “fort” craft ? Maybe from the carpathian zone of Central Europe

  17. @Robert

    Yes, that’s a possibility. It could depend on something I was wondering a while back: Does Sintashta descend from the forest steppe cultures (Fatyanovo-Balanovo first and ultimately Abashevo)? Or are they newcomers from closer to Central Europe?

    Genetically they look like newcomers, to be honest. But we lack ancient DNA from those forest steppe cultures to really know.

  18. Alberto

    Ancient Alans had y dna Q1a2 and R1.
    G2a1 was found in SaltovoMayaki culture. Modern Ossetians are predominantly G2a1. How they got the G2a1?. Maybe in S Caucasus. From BMAC is also an option.
    Also their y dna is very drifted. And they suffered certainly a bottlenecks due to Mongolian attacks.

  19. @Aram

    Thanks, now I found that the new Alan samples are being analysed and one of the 3 males (DA243) belongs to R1a-Z2124+ like the Scythians. The other 2 are Q. I’ll have to check the samples in more detail.

  20. The earliest dates for spoken IAr in South Asia are just models. We have 1400 BC for IAr like words in syria which partially helps in establishing a date(we can give a century at least for people to migrate from a native land). The speakers appear to be a small foreign minority so it does not help localize the source. theres an earlier 1800 BC Mitanni Aryan name but thats a single word. Taken together this points at spoken vedic being pushed back to at least 1500 or perhaps 1900 BC. More on this later.

  21. One of the talks at ISBA 2018:

    The Steppe was sown – multi-isotopic research changes our understandings of Scythian diet and mobility
    Alicia R. Ventresca Miller (Jena/DE)

    Might be confirming what we knew from dental inspection: that steppe pastoralists adopted a mixed (animals/grains) diet after the contacts with BMAC.

    Also somehow related is the confirmation that broomcorn millet spread from China:

    Modern and ancient DNA evidence for the origins and spread of broomcorn millet (Panicum miliaceum) from China
    Harriet Hunt (Cambridge/GB)

    This mediated by those Cental Asian nomads from Tasbas, Begash, etc…, who pioneered what would after become the Silk Road, and from whom we have a sample in this paper too: Dali_EBA (I3447).

  22. The similarities between I-A and balto-Slavic are clear. However, when are the earliest attestation of Scythian language ?

  23. Scythian languages are very poorly attested. Our knowledge of it mostly comes from toponyms, tribal and personal names, Gods names,… but from third party references, which complicate even more its secure classification.

    It’s almost surely Indo-Iranian, and usually classified as East-Iranian. Now with ancient DNA we can more or less establish the time and place of the Scythian ethnogenesis, so that gives some more information about their linguistic affiliation, as mentioned in the post.

  24. I have probed indologists and IE linguists over some time.
    Mainstream Academics all make a standard statement that.. vedic was spoken in the panjab around 1500 bc but its lip service. All their discussions and models actually assume vedic as a steppe language with very late transference of steppe terms to South Asia.

    So why do they even bother saying Vedic was spoken in India, Classical sanskrit the direct descendent is only attested from 400-500 bc. Why not put everything prior to that in the steppe?

    It’s only because they find pristine oral transmission in linguistically and geographically disparate populations with preservation down to each syllable and no loans or substrate influence. They have no explanation for it and are somewhat aghast. While this phenomenon does not allow hard dates they make some fuzzy allowance for a deep spread and anomalous preservation.

  25. I’m not saying that such views have great validity … they can neither be proved or disproved. Rather its more a social anthropological observation of mainstream academia.

    Vedic would never have been considered an Indic language except for that minor detail that it has been found in India.

    we also have the non academic children of the corn types who should be totally ignored.

  26. Speaking of “children of the corn”, over at eurogenes they’ve raised an interesting point. The “Yamnaya outlier” female, she might show Maikop admixture
    Although Matt misunderstands the importance and attribution of this “outlier”, I think the link she represents- and the distribution of people like her – will be very important
    I think even Davidski realises this, which is why he seems to have a constant bee in his bonnet about Hegarty (a great linguist who works closely with the max Planck Lab team).
    Ironically named, to compensate how wrong he was about South Asia 🙂

  27. Yes. Everyone without exception has turn into South caucasus. All groups, all labs, all major players. So it needs to be something before 4th millennium. It can only mean that they have Shulaveri shomu adna.

    If now at eurogenes there are references of koros HG and Whg…. I was the guy that kept saying Shulaveri comes from southeast europe by south shores black sea … After all their arrogance they really had it wrong in too many too much. No lost love for Davidski here.

  28. @Robert

    Yes, that outlier can give some clues about interactions going on. But in the end I don’t think that gene flow is not the only thing that we should look at in this case. As you have mentioned other times, the context, the nature of the interactions, the cultural influences, etc… are what really matters.

    In general, I think that this Eneolithic_Steppe samples from the North Caucasus steppe don’t have any recent admixture from south of the Caucasus, and mostly the same goes for Yamnaya. But that does not exclude language transmission if there was a big cultural transmission. So that’s something to look at closely to understand what might have happened (in this and in every other case).

    That said, I’m far from sold on the idea that Yamnaya spoke an IE language, so it might not even matter where they got their language from, at least for the IE question. I still think that for unravelling that mystery, the early attested languages are the ones that matter: Greek, Hittite and Sanskrit. To a lesser degree, Avestan, Celtic and Latin. The rest, including Tocharian, Balto-Slavic, Albanian or Armenian are not going to be helpful, but rather distract from the real point.


    If S-S really came from SE Europe that could explain a few things regarding Armenia_ChL. Here’s hope that we’ll get samples soon.

  29. @postneo

    I’m also rather sceptic about linguistics when it comes to such deep time spans, but there are some things that can be verified as evidence. And anyway I don’t think you’re arguing that Indo-Iranian languages were not spoken in SC Asia by 1800 BCE, but rather questioning how sure we can be about it?

    I honestly think we can be pretty sure, but if someone can present convincing arguments against it I will reconsider this position. I just think it’s not controversial for anyone to argue against it.

    In any case, more on this soon. Trying to finish the second part about SC Asia still…

  30. @Alberto

    Thanks. Still waiting for part 2 though.


    You pretty much nailed it. There seems to be incompetence or just straight up dishonesty when dealing with the subject of the Indo-Iranians.

    The non-existent dog sacrifice in Anthony’s personal version of the Rigveda, the vague description of the houses of Anahita that only matchs with Andronovo houses according to Kuzmina, and of course the nitpicking of certain Rigvedic verses (fort-destroyer related) to show that Rigvedic people were Andronovo nomads, are just to prove a preconceived idea of who the Aryans were while ignoring everything that contradicts it. There is no dog sacrifice in the Rigveda, Anahita is a 5th century BCE Mesopotamian inspired goddess so any link to Andronovo is ridiculous, and the cereal farming and prayers to be preserved in forts that are in the actual Rigveda are of course ignored.

  31. @ Alberto

    “I’m far from sold on the idea that Yamnaya spoke an IE language,”

    Neither am I. The evidence so far argues against Yamnaya having much to do with PIE, at least as far as Caucasus-Caspian spectrum of Yamnaya. Yamnaya was but one of many expanding systems & strategies during the chalcolithic revolution, & probably represents recently agro-pastoralised north Caucasus communities expanding over EHGs and SiberianHGs, speaking extinct or Paleo-Basque languages. We now know that Afansievo became extinct so the (always tenuous) link between it and “Tocharians” can be finally laid to rest.

  32. @Alberto

    Nice blog! Good post. Here are some thoughts (haven’t read all the comments yet if I am redundant).

    1 – Scythian is a very large term, and prone to confusion. Scythian as a language is unresolved. V.I. Abaev, one of the earliest to make the links to E-Iranian, has his detractors as well. In other words, some don’t think Scythian was an Iranian language. Personally, I think the closer picture is that the definition of Scythians is too imprecise. Some Scythians were Indo-Iranians, and some were non-IE speaking.

    2 – David Anthony doesn’t know anything about linguistics. He just relies on linguists around him to cherry pick his data. If he was honest, he’d see what the data he has published in his own books is actually saying. Exhibit A, this garbage: 55 non-Indo-Iranian words borrowed into common Indo-Iranian, among them the words for bread, ploughshare, canal, brick, camel, ass, sacrificing priest, soma and Indra.” This is so laughable, not sure where to begin. That “Soma”, “Priest”, and “Indra” are not Indo-Iranian in origin? Anthony is a clown, a very boring clown.

    3 – Scenario 2 in your two scenarios is the answer. 100%. And this, because of the oldest words we do know from “Scythian.”

  33. @Vara & Daemon Starfyre

    Yes, I agree that David Anthony does not have a good grasp on linguistics. As an example, he argues that PIE developed in an area between Uralic and Kartvelian languages (due to the supposed influence of them in PIE) and then he goes one to say:

    These distinctions [between Maikop and the steppe] persisted in spite of significant cross-frontier interaction. When Maikop traders came to Konstantinovka, they probably needed a translator.

    This is literally not how language contact/influence happens. With traders being the main mediators and with the help of a translator? Really? That’s how PIE got the argued Kartvelian substrate?

    @Daemon Starfyre

    Yes, I would really go for the second scenario in that linguistic problem. The first one, with pre-Balto-Slavic and pre-Indo-Iranian splitting c. 3000 BCE in central-eastern Europe, and then pre-Indo-Iranian going all the way to Iran and India would result in Balto-Slavic being very close to Germanic, Celtic or Italic, while Indo-Iranian being a very divergent branch. It simply doesn’t work.

  34. @Robert

    Yes, I’ll try to summarise soon the current situation about R1b and Western Europe. Hopefully it will help to put an end to the narrow view that Basques are an exception (they are today, but that’s irrelevant, it’s the historical situation that matters), among other things.


Comments are closed.