Improve client identification based on BEP 10

Review Request #130077 - Created April 10, 2017 and updated

Information
Fonic Maxxim
libktorrent
master
96d2adf...
Reviewers
ktorrent
aacid

Identify clients more accurately using the 'v' string embedded in extended handshake messages as described by BitTorrent Enhancement Proposal 10 'Extension Protocol' (BEP 10).

My tests showed that:
- most peers/clients send the string
- most peers/clients send useful/compliant string content
- clients are identified even if PeerID::identifyClient() fails to do so (e.g. 'Unknown client' -> 'Tixati')
- client identification becomes more accurate (e.g. 'Azureus 5.7.x.x' -> 'Vuze 5.7.x.x')

Here's a screenshot comparing identification methods. Format of column 'Client': v-string-based identification, peer-id-based identification, peer id.

yes

Fonic Maxxim
Albert Astals Cid
Fonic Maxxim
Fonic Maxxim
Review request changed

Change Summary:

Added screenshot

Description:

   

Identify clients more accurately using the 'v' string embedded in extended handshake messages as described by BitTorrent Enhancement Proposal 10 'Extension Protocol' (BEP 10).

   
   

My tests showed that:

    - most peers/clients send the string
    - most peers/clients send useful/compliant string content
    - clients are identified even if PeerID::identifyClient() fails to do so (e.g. 'Unknown client' -> 'Tixati')
    - client identification becomes more accurate (e.g. 'Azureus 5.7.x.x' -> 'Vuze 5.7.x.x')

  +
  +

Here's a screenshot comparing identification methods. Format of column 'Client': v-string-based identification, peer-id-based identification, peer id.

Alexander Trufanov

   
src/peer/peer.cpp (Diff revision 1)
 
 

How about to check if data() is not empty before overwrite stats.client? I suppose some BT clients may provide end user control over this field and it may be erased.

  1. I think data() will always contain something due to the previous if-clause, but I could check if data().toString() is an empty string before overwriting stats.client. I think that's what you meant, right?

  2. Like this:

            if ((val = dict->getValue(QByteArrayLiteral("v"))))
            {
                QString peerid = val->data().toString();
                if (!peerid.isEmpty())
                {
                    stats.client = val->data().toString();
                }
            }
    
  3. exactly

Alexander Trufanov

I've collected some statistics and it seems few torrent client violate this BEP and send client name in other than UTF-8 encoding.
This results in something like:

?Torrent 1.5(ĀµTorrent 1.6.0.0)
-XL0012--x???Awt???4(Xunlei 0.0.1.2)

It would be better to detect ? (hex BDBFEF) which Qt seems to use to substitute wrongly encoded UTF-8 characters and ignore value in this case. So we need something like

                if (!new_client.isEmpty() && !new_client.contains(QString(QChar::ReplacementCharacter)))
                    stats.client = new_client + "(" + stats.client  + ")";

P.S.: The trick with QChar::ReplacementCharacter is taken from here.
The list of tracked peer ids is here.

  1. Thank you for your testing and input on this. I alread encountered this during my own tests.

    Fortunately, only very few clients behave incorrectly, e.g. '-XL0012-% ??O??1?p (Xunlei 0.0.1.2)', a Chinese p2p app (no offense intended).

    I think I'll do it this way:
    - verify string is non-empty
    - verify string does not equal client id (would filter e.g. '-XL0012-% ??O??1?p')
    - verify string does not contain invalid characters (would filter e.g. '?Torrent 1.5')
    - parse string using regex

    regex:
    {client}{separator}{version} -> (.+)[[:space:]/]+([a-zA-Z0-9.]+) -> new_client = client + " " + version

    This should give nice and clean results, e.g.
    - 'BaiduNetdisk/2.1.2.14' -> 'BaiduNetdisk 2.1.2.14'
    - 'libtorrent/1.0.7.0' -> 'libtorrent 1.0.7.0'
    - '7.10.33.358' -> ignored

    In your list, I noticed an error:
    'MLDonkey 3...1..'

    As there are no parentheses, this is an error by peerid.cpp, not by my code. We should look into that.

  2. Just to clarify:
    'verify string does not equal client id' means 'new_client != this->getPeerID().toString()'

  3. Here's my updated testing code so far, would be great if you could do some testing yourself:

            if ((val = dict->getValue(QByteArrayLiteral("v"))))
            {
                QString vstrid = val->data().toString();
                QString vstrid2("no match");
                if (!vstrid.isEmpty() && !vstrid.contains(QChar::ReplacementCharacter))
                {
                    //stats.client = stats.client + " (" + vstrid + ")";
                    //stats.client = vstrid;
                    QRegularExpression re("^(.+)[[:space:]/]{1}[v]?([^+]+)[+]?$");
                    QRegularExpressionMatch match = re.match(vstrid);
                    if (match.hasMatch())
                    {
                        vstrid2 = match.captured(1) + " " + match.captured(2);
                        stats.client = vstrid2;
                    }                    
                }
                Out(SYS_CON | LOG_DEBUG) << QString("%1 %2 %3").arg("+++", -10).arg(vstrid, -30).arg(vstrid2, -30) << endl;
            }
    
  4. I've tested it and collected some info in form: old_stats.client (vstrid) (vstrid2).
    The results are here: https://pastebin.com/9iHiXc7S
    It mostly fine. Few cases I may highlight are:

    Halite 0.4.0.1 (Halite v 0.4.0.1) (Halite v 0.4.0.1)
    libtorrent 0.E.0.0 (Torrent2exe/2.0.120 libtorrent/0.14) (Torrent2exe/2.0.120 libtorrent 0.14)
    Transmission 2.9.2.0 (Transmission 2.92+) (Transmission 2.92)
    

    Not sure if 2nd need to be fixed.

    Note: I was testing with latest libktorent. There was a commit few days ago that seems to be aimed to improve Azureus and Tixati clients detection.
    https://github.com/KDE/libktorrent/commit/15d9bf56a67845d7b518cf993d13ae25a864b078
    You better git pull it if you didn't do this already.

Loading...