मेरा डेटा फ़ोलोइंग की तरह है

1,A-1-LYS-N,0.0704,0.1339,0.1733
9,A-1-LYS-NZ,0.0514,0.0543,0.1181
2,A-1-LYS-CA,0.0418,0.0664,0.125
7,A-1-LYS-CD,0.0406,0.0777,0.119
8,A-1-LYS-CE,0.0405,0.078,0.1206
4,A-1-LYS-O,0.0377,0.037,0.0903
6,A-1-LYS-CG,0.0373,0.0775,0.1472
5,A-1-LYS-CB,0.0233,0.0522,0.1103
3,A-1-LYS-C,0.0136,0.0256,0.0832

13,A-2-VAL-O,0.0471,0.0662,0.0647
12,A-2-VAL-C,0.0453,0.0681,0.1098
10,A-2-VAL-N,0.038,0.055,0.0657
15,A-2-VAL-CG1,0.0274,0.0435,0.0459
11,A-2-VAL-CA,0.0253,0.0485,0.0658
14,A-2-VAL-CB,0.0159,0.0049,0.0064
16,A-2-VAL-CG2,0.0149,0.0333,0.0604

27,A-3-PHE-CZ,0.0644,0.061,0.1214
26,A-3-PHE-CE2,0.0471,0.0449,0.0915
25,A-3-PHE-CE1,0.0458,0.0509,0.0653
22,A-3-PHE-CG,0.0343,0.0515,0.0878
19,A-3-PHE-C,0.0338,0.0772,0.0853
20,A-3-PHE-O,0.0269,0.0566,0.0515
24,A-3-PHE-CD2,0.025,0.0295,0.0785
18,A-3-PHE-CA,0.0244,0.0557,0.0637
21,A-3-PHE-CB,0.0225,0.0366,0.0369
17,A-3-PHE-N,0.0135,0.056,0.0997
23,A-3-PHE-CD1,0.0049,0.0204,0.0101

यह A-LEU-129 तक चलता है। मैं डेटा के प्रति समूह कॉलम 3, 4 और 5 के औसत की गणना करना चाहता हूं। डेटा के प्रत्येक समूह को एक रिक्त रेखा से अलग किया जाता है। मैं इसे awk के साथ कैसे करूं? आउटपुट इस तरह की दूसरी फाइल में होना चाहिए

A-1-LYS: 0.039, 0.066, 0.120
A-2-VAL: 0.030, 0.045, 0.059
A-3-PHE: 0.031, 0.049, 0.071

या अधिमानतः

1 0.039 A-1-LYS
2 0.066 A-1-LYS
3 0.120 A-1-LYS
1 0.030 A-2-VAL
2 0.045 A-2-VAL
3 0.059 A-2-VAL
1 0.031 A-3-PHE
2 0.049 A-3-PHE
3 0.071 A-3-PHE

शुक्रिया

मुझे पता है कि awk प्रति कॉलम औसत की गणना करता है, लेकिन मुझे यकीन नहीं है कि यह awk इसे संभाल सकता है या शायद R का उपयोग करना बेहतर है।

r awk
0
murpholinox 13 अगस्त 2019, 20:00

2 जवाब

सबसे बढ़िया उत्तर
#! /bin/bash

awk -F,  '{
    if($0 == "") {
        --NR
        if (group == "" || flag == 1){ next }
        printf "%s: %0.3f, %0.3f, %0.3f\n", group, sum1/NR, sum2/NR, sum3/NR 
        NR=0
        sum1=sum2=sum3=0
        flag=1
        next
    }
    if (NR == 1) {
        flag=sum1=sum2=sum3=0
        split($2, arr, "-")
        group=arr[1]"-"arr[2]"-"arr[3]
    }
    sum1+=$3
    sum2+=$4
    sum3+=$5    
} END {if(flag==0){--NR ; printf "%s: %0.3f, %0.3f, %0.3f\n", group, sum1/NR, sum2/NR, sum3/NR}}' file

अन्य पसंदीदा प्रारूप:

#! /bin/bash

awk -F, '{
    if($0 == "") {
        --NR
        if (group == "" || flag == 1){ next }
        print "1", sum1/NR, group
        print "2", sum2/NR, group
        print "3", sum3/NR, group
        NR=0
        sum1=sum2=sum3=0
        flag=1
        next
    }
    if (NR == 1) {
        flag=0
        sum1=0;sum2=0;sum3=0
        split($2, arr, "-")
        group=arr[1]"-"arr[2]"-"arr[3]
    }
    sum1+=$3
    sum2+=$4
    sum3+=$5    
} END {if(flag==0){NR=NR-1 ; print group, sum1/NR, sum2/NR, sum3/NR}}' file

यदि आप चाहते हैं कि संख्याएँ केवल 3 दशमलव स्थानों तक प्रदर्शित हों,

#! /bin/bash

awk -F, '{
    if($0 == "") {
        --NR
        if (group == "" || flag == 1){ next }
        print "1", substr(sum1/NR, 0, 5), group
        print "2", substr(sum2/NR, 0, 5), group
        print "3", substr(sum3/NR, 0, 5), group
        NR=0
        sum1=sum2=sum3=0
        flag=1
        next
    }
    if (NR == 1) {
        flag=0
        sum1=0;sum2=0;sum3=0
        split($2, arr, "-")
        group=arr[1]"-"arr[2]"-"arr[3]
    }
    sum1+=$3
    sum2+=$4
    sum3+=$5    
} END {if(flag==0){NR=NR-1 ; print group, sum1/NR, sum2/NR, sum3/NR}}' file

इसे आज़माएं, अगर आप आश्वस्त कर सकते हैं कि आपकी फ़ाइल के अंत में एक रिक्त रेखा है, तो आपको END ब्लॉक और flag की आवश्यकता नहीं होगी।

मुझे R के बारे में कोई जानकारी नहीं है, इसलिए पता नहीं अगर R बेहतर होगा।

3
Mihir 13 अगस्त 2019, 22:32

यह समाधान आर में काम करता है।

# Create dataframe
df <- data.frame(
    x1 = c(1, 9, 27, 26),
    x2 = c('A-1-LYS-N', 'A-1-LYS-NZ', 'A-LEU-129-CZ', 'A-LEU-129-CE2'),
    x3 = c(0.0704, 0.0514, 0.0644, 0.0471),
    x4 = c(0.1339, 0.0543, 0.061, 0.0449),
    x5 = c(0.1733, 0.1181, 0.1214, 0.0915)
)

# Create groupings
library(stringr);
df$grouping <- word(gsub('-',' ',df$x2),1,3); df$grouping <- gsub(' ','-',df$grouping)

# Calculate means by groupings
library(dplyr)
summarize(group_by(df, grouping), mean(x3), mean(x4), mean(x5), n())
2
Monk 13 अगस्त 2019, 21:12